<a href="https://colab.research.google.com/github/CometSplit/DS2500/blob/main/Series.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

based in part on materials by Prof. Alex Lex

## The Pandas Library: Series

Pandas is a popular library for manipulating vectors, tables, and time series. We will frequently use Pandas data structures instead of the built-in python data structures, as they provide much richer functionality. Also, Pandas is **fast**, which makes working with large datasets easier.  Check out the official pandas website at [http://pandas.pydata.org/](http://pandas.pydata.org/).

This tutorial is partially based on the [excellent book by Matt Harrison](https://www.amazon.com/Learning-Pandas-Library-Analysis-Visualization-ebook/dp/B01GIE03GW/).

Pandas provides three data structures:

 * the **series**, which represents a single column of data similar to a python list
 * the **data frame**, which represents multiple series of data
 * the **panel**, which represents multiple data frames

We'll mostly work with series and data frames and largely ignore panels.

To make pandas available, we'll import the module into this notebook. It is customary to import pandas as `pd`:

In [None]:
import pandas as pd

Series are the most fundamental data structure in pandas. Let's create two simple series based on an arrays:

In [None]:

bands = pd.Series(["Stones", "Beatles", "Zeppelin", "Pink Floyd"])
bands

In [None]:
founded = pd.Series([1962, 1960, 1968, 1965])
founded

When we output these objects we can see an index, also called an axis, which by default is an integer sequence starting at 0, and the associated values.

| Index | Value |
| - | - |
| 0  |        Stones
|1   |    Beatles
|2  |    Zeppelin
|3 |    Pink Floyd

Pandas also tells us the data type of the values, `object` for the first series – in this case, this is a string, `int64` (a 64-bit integer) for the second.

Notice that `int64` is not a Python datatype, but a C integer of 64 bit length – which, unlike Python integers – can overflow!

We can also use other data types as indices, in which case the series behaves a lot like a dictionary:

In [None]:
# the data is the first parameter, the index is given by the index keyword
bands_founded = pd.Series([1962, 1960, 1968, 1965, 2012],
                          index=["Stones", "Beatles", "Zeppelin", "Pink Floyd", "Pink Floyd"],
                          name="Bands founded")
bands_founded

| Index | Value |
| - | - |
| Stones     |    1962
| Beatles    |    1960
| Zeppelin     |  1968
| Pink Floyd |    1965
| Pink Floyd |    2012

Here we see something interesting: We've used the same index (Pink Floyd) twice, once for the original founding of the band, and once for the re-union starting in 2012. Also, the order of the entries is preserved.

A series is, so to speak, both, a list and a dictionary!

We can access the values of a series by printing the member `values`.

In [None]:
bands.values

And we can look at how the index is composed:

In [None]:
bands.index

What we see here is that this isn't an explicit list, but rather a set of rules!

Let's compare this to the index where we used explicit labels:

In [None]:
bands_founded.index

We can access individual entries as we'd access an array or a dictionary:

In [None]:
bands[0]

In [None]:
bands_founded["Pink Floyd"]

There is also a method for looking up a value:

In [None]:
bands_founded.get("Stones")

Note that these access methods are as fast as a dictionary lookup, and much faster than a lookup in a list.

That works also with arrays of labels, in which case the return type is a series, not a single value.

In [None]:
bands_founded.get(["Stones", "Beatles"])

Notice that when we access data with multiple indices, we don't get a simple data type, as in the above cases, but instead get another series back:

In [None]:
bands_founded["Pink Floyd"]

Series also have indexers for label-based access: [`loc`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)

In [None]:
# And one more way for looking up a value:
bands_founded.loc["Stones"]
# this is equivalent to
# bands_founded["Stones"]

Related to the `loc` indexer is the [`iloc`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iloc.html) indexer. However, instead of an index, `iloc` operates purely on position, not on index labels:

In [None]:
bands_founded.iloc[1]

These ways of accessing slices of a dataset (`loc`, `iloc`), will make more sense when we use dataframes instead of series - in dataframes, `loc` and `iloc` operate on the rows, whereas square brackets operate on the columns.

### Iterating

Iteration works as you would expect:

In [None]:
for band in bands:
    print(band)

In [None]:
for band, founded in bands_founded.items():
    print(band + ", " + str(founded))

### Updating
Updating works largely as expected, however, you have to be careful when updating series with duplicate indices:

In [None]:
bands[2] = "The Doors"
bands

We can add a new item by direclty assigning it to a new index.

In [None]:
bands[4] = "Zeppelin"
bands

Note that the indices don't have to be sequential.

In [None]:
bands[17] = "The Who"
bands

When we update based on an index that occurs more than once, all instances are updated:

In [None]:
bands_founded["Pink Floyd"] = 2015
bands_founded

A way to update a specific entry when an index is used multiple time is to use the `iloc` indexer. We can use the `iloc` array to set values based purely on position. However, all of this is rather ugly.

In [None]:
bands_founded.iloc[4] = 1965
bands_founded

In [None]:
bands.index

### Deleting

Deleting is rarely done with pandas data structures, instead filters and masks are used. It's possible based on indices:

In [None]:
del bands_founded["Stones"]
bands_founded

### Indexing and slicing

Indexing and slicing works largely like in normal python, but instead of just directly using the bracket notations, it is recommended to use `iloc` for indexing by position and `loc` for indexing by index.

In [None]:
# slicing by position
bands_founded.iloc[1:3]

When slicing by index, the last value specified is *included*, which differs from regular Python slicing behavior.

In [None]:
# slicing by index
bands_founded.loc["Zeppelin" : "Pink Floyd"]

In [None]:
# Note that index 17 is included
bands.loc[1:17]

Again, for series (not for data frames), `loc` and just using bracket notation is identical:

In [None]:
bands[2:17]

Both, `iloc` and `loc` can be used with arrays, which isn't possible in vanilla Python:

In [None]:
bands_founded.iloc[[0,3]]

In [None]:
bands_founded.loc[["Beatles", "Pink Floyd"]]

And, all these variants can also be used with boolean arrays, which we will soon find out to be very helpful:

In [None]:
bands_founded

In [None]:
bands_founded.loc[[True, False, False, True]]

### Masking and Filtering

With pandas we can create boolean arrays that we can use to mask and filter a dataset. In the following expression, we'll create a new series that has "True" for every band formed after 1964:

In [None]:
mask = bands_founded > 1964
mask

This uses a technique called **broadcasting**. We can use broadcasting with various operations:

In [None]:
# Not particularly useful for this dataset..
founding_months = bands_founded * 12
founding_months

We can use a boolean mask to filter a series, as we've seen before:

In [None]:
# applying the mask to the original array
# note that almost all of those operations return a new copy and don't modify in place
bands_founded[mask]

The short form here would be:

In [None]:
#return bands founded, where bands founded (the values in the Series - years) > 1964
bands_founded[bands_founded > 1964]

In [None]:
bands_founded

## Exploring a Series

There are various way we can explore a series. We can count the number of non-null values:

In [None]:
numbers = pd.Series([1962, 1960, 1968, 1965, 2012, None, 2016])
numbers.count()

In [None]:
numbers

We can get the sum, mean, median of a series:

In [None]:
numbers.sum()

In [None]:
numbers.mean()

In [None]:
numbers.median()

We can also get an overview of the statistical properties of a series:

In [None]:
numbers.describe()

Note that None/NaN values are ignored here. We can drop all NaN values if we desire:

In [None]:
numbers = numbers.dropna()
numbers

In [None]:
bands_founded

This works also for non-numerical data. Of course, we get different measures:

In [None]:
bands.describe()

Other useful methods are asking for a specific quantile, the minimum, the maximum, etc.

In [None]:
numbers.quantile(0.25)

In [None]:
numbers.max()

In [None]:
numbers.min()

## Sorting

We can sort a series:

In [None]:
numbers.sort_values()

In [None]:
sorted_numbers = numbers.sort_values(ascending=False)
sorted_numbers

Note that the indices remain constant. We can **reset the indices**:

In [None]:
# If we don't specify drop to be true, the previous indices are preserved in a separte column
sorted_numbers = sorted_numbers.reset_index(drop=True)
sorted_numbers

We can also sort by the index:

In [None]:
# mix up the indices first
new_sorted_numbers = numbers.sort_values()
print(new_sorted_numbers)
new_sorted_numbers.sort_index()

## Applying a Function

Often, we will want to apply a function to all values of a Series. We can do that with the map function.

This is an incredibly powerful concept that you can use to modify series in sophisticated ways.

Another way to use the map function is to pass in a dictionary that is then applied to matching objects:

In [None]:
new_sorted_numbers.map({1965:1945, 2012:1999, 1968:"What"})

## Conclusion

Series (and data frames) are incredibly powerful. We've only covered a small part of the features here. Make sure to also check out resources such as the [10 minutes to pandas guide](http://pandas.pydata.org/pandas-docs/stable/10min.html).

### Exercise: Pandas Series

Create a new pandas series with the lists given below that contain NFL team names and the number of Super Bowl titles they won. Use the names as indices, the wins as the data.

 * Once the list is created, sort the series alphabetically by index.
 * Print an overview of the statistical properties of the series. What's the mean number of wins?
 * Filter out all teams that have won less than four Super Bowl titles
 * A football team has 45 players. Update the series so that instead of the number of titles, it reflects the number of Super Bowl rings given to the players.
 * Assume that each ring costs USD 30,000. Update the series so that it contains a string of the dollar amount including the \$ sign. For the Steelers, for example, this would correspond to:
 ```
 Pittsburgh Steelers             $ 8100000
 ```


In [None]:
teams = ["Pittsburgh Steelers",
"Dallas Cowboys",
"San Francisco 49ers",
"New England Patriots",
"Green Bay Packers",
"New York Giants",
"Denver Broncos",
"Oakland/Los Angeles Raiders",
"Washington Redskins",
"Miami Dolphins",
"Baltimore/Indianapolis Colts",
"Baltimore Ravens"]
wins = [6,5,5,4,4,4,3,3,3,2,2,2]

**Take a poll here**: [poll1](https://PollEv.com​/marinakogan791)