# Pandas Series

## References

[pandas website](https://pandas.pydata.org/)

Includes link to pdf for *pandas: powerful Python data analysis toolkit*, free online alternative to *Python for Data Analysis* by Wes McKinney 

## Setup

This is the standard import statement for pandas:

In [None]:
import pandas as pd

# Series

Basic Python has 2 built-in one-dimensional data structures: dictionaries and lists.

Here's an example of a dictionary:

In [None]:
states_dict = {'OH': 'Ohio', 'TN': 'Tennessee', 'AZ': 'Arizona', 'PA': 'Pennsylvania', 'AK': 'Alaska'}
print(states_dict)

Dictionary items are addressable by key, but not by integer index because they are unordered.

In [None]:
print(states_dict['TN'])

In [None]:
print(states_dict[2])

Lists are ordered, but don't have labels.

In [None]:
animal_list = ['lizard', 'spider', 'worm', 'bee']
print(animal_list)

List items are addressable by integer index, but not by label.

In [None]:
print(animal_list[2])

In [None]:
print(animal_list['reptile'])

Series are one-dimensional pandas data structures that are sort of a hybrid of dictionaries and lists. They are ordered, but they also are labeled.

We can create an instance of a Series by passing a dictionary as an argument into `pd.Series()`:

In [None]:
states_series = pd.Series(states_dict)
print(states_series)

When a Series is displayed, the label index is shown on the left and the Series items are shown on the right.

We can refer to items in a Series by either position (using an integer index) or by their name (using the label index for the item). Integer indexing is zero-based as with everything else in Python.

In [None]:
print(states_series[2])
print(states_series['TN'])

Series item 2 is the third item in the Series since we start counting with zero.

There is an alternate way of referring to Series items by position that makes the indexing system explicit. `.loc[]` locates items by label index, and `.iloc[]` locates items by integer index. WARNING: One gotcha here is that the "i" in "iloc" should be thought of as referring to "integer", NOT "index". In pandas, when the term "index" is used by itself, it refers to the label index, not the integer index.

Specifying a single index in `.loc[]` or `.iloc[]` returns a single value from the Series. In this case the values are strings, so the type of the returned value is string.

In [None]:
print(states_series.iloc[2])
print(states_series.loc['TN'])
print(type(states_series.loc['TN']))

## Why do we care about Series?

There doesn't immediately seem to be a big advantage of using a panda Series over generic Python data structures like lists and dictionaries. However, they can be important in two ways:

1. Using them can make operations much faster because they are built from NumPy arrays.
2. Their behavior is similar to the more commonly used pandas DataFrames, but Series are simpler and therefore easier to understand.

We can see how Series are constructed by dissecting them into their values and label index parts.

In [None]:
print(states_series.values)
print(type(states_series.values))
print(states_series.index)
print(type(states_series.index))

From this we see that the values in the series are actually a NumPy array.

Both the values and the label indices are iterable, so we can use them in `for` loops.

In [None]:
for state_name in states_series.values:
    print(state_name)
    print(type(state_name))

In [None]:
for state_abbrev in states_series.index:
    print(state_abbrev)
    print(type(state_abbrev))

However, because series are based on NumPy arrays, they can also be used in vectorized operations, so it's really kind of dumb (and slower) to use them in `for` loops.

In [None]:
# Perform manipulation using a for loop
result = []
for state_name in states_series.values:
    result.append(state_name + ' State')
print(result)
print(type(result))

In [None]:
# Perform manipulation as a vectorized operation
result = states_series + ' State'
print(result)
print(type(result))

Notice that the result of the vectorized operation is another Series that shares the same respective label indices as the original Series.

## Slicing a Series

We can use `.iloc[]` to slice a Series by integer position in a manner similar to the way we slice lists in basic Python. We use a range of positions, separated by a colon, inside of the square brackets instead of a single number as we did to retrieve one value. As with lists, the last item in the range is not included. That is, the range `1:4` includes items 1 through 3, but not 4.

The resulting slice is also a Series, and the label indices remain associated with the sliced values in the resulting Series.


In [None]:
print(states_series)
print()
print(states_series.iloc[1:4])

We can explicitly specify the items we want by using a list of integer indices inside the square brackets. The order of the items in the resulting Series will be the order they were specified in the list.

In [None]:
print(states_series.iloc[ [1, 3, 0] ])

Series can also be sliced by a range of their labels using `.loc()` (but inclusive of end of range).

In [None]:
print(states_series)
print()
print(states_series.loc['TN':'PA'])

Notice that the end of the range (`PA`) was included in the slice.

We can also slice by specifying a list of label indices to be included.

In [None]:
print(states_series.loc[ ['TN', 'AK', 'OH'] ])

## Slicing by condition

Evaluating a condition involving the Series is a vectorized operation, so the result will be a Series of booleans corresponding to whether each the values in the series meets the condition. The following is a vectorized boolian operation on the `states_series` Series.

In [None]:
states_series == 'Tennessee'

Some googling and StackOverflow may be required to find out how to set up the condition you want. For example, here's how to check if a value starts with a particular string.

In [None]:
states_series.str.startswith('A')

In addition to explicitly locating items in the Series using `.loc[]`, we can also locate items by condition. If we insert a conditional expression inside the square brackets, the resulting slice will include any items for which the condition is `True`.

In [None]:
print(states_series.loc[states_series == 'Tennessee'])

In [None]:
print(states_series.loc[states_series.str.startswith('A')])

Notice that the result of slicing by condition using `.loc[]` is a series, even if the result contains only a single item.

# Slices vs. copies

One important feature of Python that becomes important at this point is that using the assignment operator does not actually make a copy of the object on the right. Instead, the assignment operator assigns a name to the object. 

In [None]:
first_number = 3
second_number = first_number
second_number = 6
print(first_number)

In the example above, we might think that changing the value of `second_number` doesn't affect the value of `first_number` because we made a copy of `first_number`, leaving `first_number` unaffected. But try this:

In [None]:
first_list = ['dog', 'cat', 'bird']
print('first list:', first_list)
second_list = first_list
second_list[1] = 'horse'
print('second list:', second_list)
print('first list now:', first_list)

What's going on here? After the assignment statement in the third line, when we change `second_list`, our changes also indirectly affect `first_list`. The reason is that in the third line the assignment statement applies an additional name to the list object, but it doesn't make a separate copy of the list. Both `first_list` and `second_list` refer to the same list object, so changing one of them also changes the other. Because lists are mutable (changeable) objects, 

```
second_list[1] = 'horse'
```

affects both lists.

In the earlier example, `first_number` is the name applied an immutable (unchangeable) integer object (`3`). 

```
second_number = first_number
```

applies a new name to the integer `3`. But 

```
second_number = 6
```

cannot change the integer object `3` into `6` because `3` is immutable. Rather, it re-assigns the name `second_number` to a different object, the integer `6`.

The important thing here is that assignment statements do not create a copy of mutable objects. Changing the linked object will affect the original object unless copying is made explicit, as in this example:

In [None]:
first_list = ['dog', 'cat', 'bird']
print('first list:', first_list)
second_list = first_list.copy()
second_list[1] = 'horse'
print('second list:', second_list)
print('first list now:', first_list)

In this example, the name `second_list` is explicitly assigned to a copy of `first_list` rather than `first_list` itself. So changes made to `second_list` have no effect on `first_list`.

We have to be careful about this if we want to make modifications to slices. 

In [None]:
states_series = pd.Series({'OH': 'Ohio', 'TN': 'Tennessee', 'AZ': 'Arizona', 'PA': 'Pennsylvania', 'AK': 'Alaska'})
print(states_series)
print()
states_slice = states_series.loc['TN':'PA']
states_slice['AZ'] = 'The Grand Canyon State'
print(states_slice)
print()
print(states_series)

In the example above, even though the modification was made on the slice, it affected the original series. To avoid this, we need to make the assignment to a copy of the slice.

In [None]:
states_series = pd.Series({'OH': 'Ohio', 'TN': 'Tennessee', 'AZ': 'Arizona', 'PA': 'Pennsylvania', 'AK': 'Alaska'})
print(states_series)
print()
states_slice = states_series.loc['TN':'PA'].copy()
states_slice['AZ'] = 'The Grand Canyon State'
print(states_slice)
print()
print(states_series)

## Making changes "stick"

When you perform a sort on a Python list using the `.sort()` method, nothing is returned. Rather, the sort is performed "in place" on the list, replacing the original list with the sorted one. 

In [None]:
day_list = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
day_list.sort() # Sort the list, nothing is returned (and therefore nothing is displayed when the cell is run)

In [None]:
# Display the current condition of the list (sorted=alphabetized)
day_list

You can sort a pandas Series by its values using `.sort_values()`. If the values are strings, they will be alphabetized. In contrast to a list sort, the `.sort_values()` method does return something: a `view` of the series in sorted form. Because a value is being returned, the view is displayed below the cell when it is run.

In [None]:
states_series.sort_values()

However, the sorted view of Series was only displayed and not assigned to anything. The original Series was not changed, as we can see by displaying it:

In [None]:
states_series

To make the change (i.e. the sort of the Series) permanent, we can save the sorted series under a different name by assigning a copy of the view to the new name. If we don't use the `.copy()` method, then the newly named sorted series will still be a view of the original Series and changes we make to the sorted Series would affect the original Series.

In [None]:
sorted_series = states_series.sort_values().copy()
sorted_series

If we want the sorted series to replace the original one, re-assign the original name to the sorted series.

In [None]:
state_series = states_series.sort_values()
state_series

This pattern (assigning the original name of the object to a modified view) is probably the best way to apply changes that we make to a pandas data object without creating a new copy of it. It's useful if we add or delete rows or columns, sort, change values, etc.

Another option is to perform the operation using the `inplace=True` argument, rather than assigning the result to a different variable.

In [None]:
states_series = pd.Series({'OH': 'Ohio', 'TN': 'Tennessee', 'AZ': 'Arizona', 'PA': 'Pennsylvania', 'AK': 'Alaska'})
print(states_series)
print()
states_series.sort_values(inplace=True)
print(states_series)

Notice that when the `inplace=True` argument is used, the method does not return anything. Instead, it applies the change to the original object.

*Note: as of October 2022, there is some discussion about deprecating the* `inplace` *argument since it is considered to have some unintended pitfalls.*

There are several ways to change the type of sort. The keyword `ascending=False` can be used to reverse the direction of the sort (reverse alphabetical order for strings, largest to smallest for numbers).

In [None]:
states_series = pd.Series({'OH': 'Ohio', 'TN': 'Tennessee', 'AZ': 'Arizona', 'PA': 'Pennsylvania', 'AK': 'Alaska'})
print(states_series)
print()
print(states_series.sort_values(ascending=False))


The sort can also be performed on the label index rather than on the values if you use `.sort_index()`

In [None]:
states_series = pd.Series({'AZ': 'Arizona', 'AR': 'Arkansas', 'AL': 'Alabama', 'IA': 'Iowa', 'ID': 'Idaho'})
print(states_series.sort_values())
print()
print(states_series.sort_index())


# Optional: Series without explicit label indices

If we construct a Series from a list rather than a dictionary, pandas will generate label indices from a sequence of integers starting with zero. 

In [None]:
animal_list = ['lizard', 'spider', 'worm', 'bee']
animal_series = pd.Series(animal_list)
print(animal_series)

At this point, the positional integer index and the label index for a particular item will be the same. So using `.loc()` and `.iloc()` will produce the same result

In [None]:
print(animal_series.loc[2])
print(animal_series.iloc[2])

However, if we create a slice of the original Series, the positional integer indexing for the slice will begin with zero, but the label indices for items in the slice will be the same as in the original Series

In [None]:
animal_slice = animal_series.iloc[2:]
print(animal_slice)

In [None]:
print(animal_slice.iloc[0])
print(animal_slice.loc[2])

To avoid this comfusion, it's convenient to assign some kind of unique non-integer identifiers to the Series items. Here's an example:

In [None]:
id_animal_series = pd.Series(animal_list, index=['a', 'b', 'c', 'd'])
print(id_animal_series)

In [None]:
id_animal_slice = id_animal_series.iloc[2:]
print(id_animal_slice)

In [None]:
print(id_animal_slice.iloc[0])
print(id_animal_slice.loc['c'])

# Practice exercises

The first cell queries Wikidata to get the labels of artworks from the Vanderbilt Fine Arts Gallery that have their images in Wikimedia Commons. It returns a list of URLs and a list of labels for the works.

**Note:** If you installed stand-alone Jupyter notebooks, you may need to install the `requests` module. If you are using an Anaconda installation or Colab, that module should already be installed.

In [None]:
import requests

def get_wikidata():
    query_string = '''select distinct ?qid ?label where {
  ?qid wdt:P195 wd:Q18563658. # must be in the collection Vanderbilt University Fine Arts Gallery
  ?qid wdt:P18 ?image. # item must be depicted in Wikimedia Commons
  ?qid rdfs:label ?label. # get the label
  filter(lang(?label)='en') # filter the labels to include only English
  }'''
    
    response = requests.get('https://query.wikidata.org/sparql', params={'query' : query_string}, headers={'Accept': 'application/sparql-results+json'})
    data = response.json()
    results = data['results']['bindings']
    
    urls = []
    labels = []
    for result in results:
        urls.append(result['qid']['value'])
        labels.append(result['label']['value'])
    return urls, labels

urls, labels = get_wikidata()

In [None]:
import pandas as pd
labels_series = pd.Series(labels, index=urls)
labels_series.head()