# The Pandas Series

Pandas provides a `Series` object that is very similar to a numpy array, but with some additional functionality. In this lesson, we will learn about pandas Series and how to work with them.

## Overview

A pandas Series object is a one-dimensional, labeled array made up of an autogenerated index that starts at 0 and data of a single data type.

A couple of important things to note about a Series:

- If I try to make a pandas Series using multiple data types like `int` and `string` values, the data will be converted to the same `object` data type; the `int` values will lose their `int` functionality. 


- A pandas Series can be created in several ways; we will look at a few of these ways below. However, **it will most often be created by selecting a single column from a pandas Dataframe in which case the Series retains the same index as the Dataframe.** We will dive into this in the next two lessons: DataFrames and Advanced DataFrames.

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

Convention is to import the `pandas` module with the alias `pd`.

## Create a Series

We can use the pandas Series constructor function to create a Series: 

- from a Python list or a NumPy array.

In [None]:
numbers_series = pd.Series([100, 43, 26, 17, 17])
type(numbers_series)

In [None]:
numbers_series

Notice what happens when we create a Series containing different data types:

In [None]:
pd.Series([ , , ])

In [None]:
letters_series = pd.Series(['a', 'e', 'h', 'd', 'b', 'z'])
letters_series

- from a Python dictionary.

In [None]:
labeled_series = pd.Series({'a' : 0, 'b' : 1.5, 'c' : 2, 'd': 3.5, 'e': 4, 'f': 5.5})
labeled_series

## Vectorized Operations

Like numpy arrays, pandas series are vectorized by default, for example, we can easily use the basic arithmatic operators to manipulate every element in the series.

In [None]:
numbers_series 

In [None]:
numbers_series 

Comparison operators also work:

In [None]:
numbers_series == 

In [None]:
numbers_series > 

## Series Attributes

**Attributes** return useful information about a Series' properties; they don't perform operations or calculations with the Series. Attributes are easily accessible using dot notation like we will see in the examples below. *Jupyter Notebook allows you to quickly access a list of available attributes by pressing the tab key after the series name followed by a period or dot; this is called dot notation or attribute access.*

There are several components that make up a pandas Series, and I can easliy access each component by using attributes.

### `.index`

- **The index** allows us to reference items in the series. In our numbers_series, the index consists of the numbers 0-3.

In [None]:
# Access the index of the Series using dot notation.



### `.values`

- **The values** are my data.

In [None]:
# The values are stored in a NumPy array. Hello vectorized operations!



### `.dtype`

- **The dtype** is the data type of the elements in the Series. In our numbers_series,  the data type is `int64`; it was inferred from the data we used.

    Pandas has several main data types we will work with:

    - int: integer, whole number values
    - float: decimal numbers
    - bool: true or false values
    - object: strings
    - category: a fixed and limited set of string values

### `.name`

- **The name** is an optional human-friendly name for the Series.

Our Series doesn't have a name, but we can give it one:

In [None]:
numbers_series.name

In [None]:
numbers_series

### `.size`

- The `.size` attribute returns an int representing the number of rows in the Series. *NULL values are included.*

In [None]:
numbers_series.size

### `.shape`

- The `.shape` attribute returns a tuple representing the rows and columns when used on a two-dimensional structure like a DataFrame, but it can also be used on a Series to return its number of rows. *NULL values are included.*

In [None]:
numbers_series.shape

## Series Methods

**Methods** used on pandas Series objects often return new Series objects; most also offer parameters with default settings designed to keep the user from mutating the original Series objects. (`inplace=False`)

- If I want to save any manipulations or transformations I make on my Series, I can either assign the Series to a variable or adjust my parameters. 

Series have a number of useful methods that we can use for various sorts of manipulations and transformations; let's look at a few.

### `.head`, `.tail`, `.sample`

- The `.head(n)` method returns the first n rows in the Series; `n = 5` by default. This method returns a new Series with the same indexing as the original Series. 


- The `.tail(n)` method returns the last n rows in the Series; `n = 5` by default. Increase or decrease your value for n to return more or less than 5 rows.


- The `.sample(n)` method returns a random sample of rows in the Series; `n = 1` by default. Again, the index is retained.

In [None]:
numbers_series

In [None]:
numbers_series

In [None]:
numbers_series.sample()

### `.astype`

We can convert the data types of the values in our Series with the `.astype` method.

In [None]:
num_strings = pd.Series([3, 4, 5, 6]).astype('str')
num_strings

In [None]:
floats = pd.Series([3, 4, 5, 6]).astype('float')
floats

In [None]:
floats.astype('int')

### `.value_counts`

The `.value_counts()` method returns a new Series consisting of a labeled index representing the unique values from the original Series and values representing the frequency each unique value appears in the original Series. *It's like performing a SQL `GROUP BY` with a `COUNT`.*


- This is an extremely useful method you will find yourself using often with Series containing object and category data types. 

In [None]:
pd.Series(['a', 'b', 'a', 'c', 'b', 'a', 'd', 'a']).value_counts()

### `.describe`

The `.describe` method returns a Series of descriptive statistics on a pandas Series. The information it returns depends on the data type of the elements in the Series.

In [None]:
numbers_series.describe()

**More Descriptive Statistics Methods**

Pandas has a number of methods that can be used to view summary statistics
about our data. The table below ([taken from
here](https://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics))
provides a summary of some of the most commonly used methods.

| Function   | Description                                |
|----------  |-------------                               |
| `count`    | Number of non-NA observations              |
| `sum`      | Sum of values                              |
| `mean`     | Mean of values                             |
| `median`   | Arithmetic median of values                |
| `min`      | Minimum                                    |
| `max`      | Maximum                                    |
| `mode`     | Mode                                       |
| `abs`      | Absolute Value                             |
| `std`      | Bessel-corrected sample standard deviation |
| `quantile` | Sample quantile (value at %)               |

In [None]:
{
    'count': numbers_series.count(),
    'sum': numbers_series.sum(),
    'mean': numbers_series.mean()
}

### `.nlargest`, `.nsmallest`

These methods allow me to return the n largest or n smallest values from a pandas Series. *I can set the `keep` parameter to `first`, `last`, or `all` to deal with duplicate largest or smallest values; this is quite handy.*

The default argument for keep is shown below.

```python
Series.nlargest(n=5, keep='first')
Series.nsmallest(n=5, keep='first')
```

In [None]:
numbers_series.nlargest(n=1)

In [None]:
# If I want to return all of the lowest values, not just the first instance.

numbers_series.nsmallest(n=1, keep='all')

### `.sort_values`, `.sort_index`

These are handy methods that allow you to either sort your Series values or index respectively in ascending or descending order.

- I can use the parameters for these methods to customize my sorts to meet my needs.

In [None]:
letters_series.sort_values()

In [None]:
# The Series values retain their index from the original Series.

letters_series.sort_values(

In [None]:
# I can reset the index using this parameter if that meets my needs.

letters_series.sort_values(ignore_index=True)

In [None]:
# I can also sort by index values.

labeled_series.sort_index(ascending=False)

## Exercises Part I

Make a file named `pandas_series.py` or `pandas_series.ipynb` for the following exercises.

Use pandas to create a Series named fruits from the following list:

        ["kiwi", "mango", "strawberry", "pineapple", "gala apple", "honeycrisp apple", "tomato", "watermelon", "honeydew", "kiwi", "kiwi", "kiwi", "mango", "blueberry", "blackberry", "gooseberry", "papaya"]
        
Use Series attributes and methods to explore your fruits Series.

In [None]:
fruits = ["kiwi", "mango", "strawberry", "pineapple", "gala apple", "honeycrisp apple", "tomato", "watermelon", "honeydew", "kiwi", "kiwi", "kiwi", "mango", "blueberry", "blackberry", "gooseberry", "papaya"]

In [None]:
fruits = pd.Series(fruits)

In [None]:
type(fruits)

1. Determine the number of elements in fruits.
    

2. Output only the index from fruits.
    

3. Output only the values from fruits.
    

4. Confirm the data type of the values in fruits.
    

5. Output only the first five values from fruits. Output the last three values. Output two random values from fruits.
    

6. Run the `.describe()` on fruits to see what information it returns when called on a  Series with string values.

7. Run the code necessary to produce only the unique string values from fruits.

8. Determine how many times each unique string value occurs in fruits.

9. Determine the string value that occurs most frequently in fruits.

10. Determine the string value that occurs least frequently in fruits.

___

## Indexing and Subsetting

- This is where the pandas index shines; we can select subsets of our data using index labels, index position, or boolean sequences (list, array, Series).


- I can also pass a sequence of boolean values to the indexing operator, `[]`; that sequence could be a list or array, but it can also be another pandas Series **if the index of the boolean Series matches the original Series**.

In [None]:
numbers_series

In [None]:
# I can see that my condition is being met by the values at index 0 and index 1.

bools = numbers_series > 40
bools

In [None]:
# I pass my boolean mask to the original Series to return the values that meet the condition.

numbers_series[bools]

In [None]:
# I can simply pass my conditional expression into the indexing operator, too.

numbers_series[numbers_series > 40]

## More Series Attributes

### `.str`

In addition to vectorized arithmetic operations, pandas also provides us with a way to vectorize string manipulation. Once we access the `.str` attribute, we can apply a string method to each string value in a Series. *Performing string manipulation like this does not mutate my original Series; I have to assign my manipulation to a variable if I want to keep it.*

For example, we can call the `.lower` method, which will convert each string value in the string_series to lowercase.

In [None]:
string_series = pd.Series(['Hello', 'CodeuP', 'StUDenTs'])
string_series

In [None]:
string_series.str.lower()

In [None]:
string_series.str.replace('e', '_')

In [None]:
string_series

In [None]:
# Since each method returns a Series, I can use method chaining like this.

string_series.str.lower().str.replace('e', '_')

In [None]:
string_series.str.lower().str.startswith('h')

In [None]:
# I can even use method chaining and indexing!

string_series[string_series.str.lower().str.startswith('h')]

In [None]:
# Notice my original string_series is not mutated. 

string_series

## More Series Methods

### `.any`, `.all`

We can use the `.any` method to check if any value in the series is `True`, and `.all`, to check if every value in a Series is `True`. Both methods return a boolean value denoting whether the condition is met.

For example, we could check to see if there are any negative values in a Series like this:

In [None]:
numbers_series < 0

In [None]:
(numbers_series < 0).any()

In [None]:
(numbers_series < 0).all()

We could check if all the numbers are positive like this:

In [None]:
(numbers_series > 0).any()

In [None]:
(numbers_series > 0).all()

In [None]:
series1.isna().any()

### `.isin`

The `.isin` method can be used to tell whether each element in a Series matches an element in a passed sequence of values. For example, if we have a Series of letters, we could use `.isin` to tell whether each letter is a vowel.

In [None]:
# Create a list of vowels.

vowels = list('aeiou')
vowels

In [None]:
# Create a list of letters.

letters = list('abcdefghijk')
letters

In [None]:
# Construct a pandas Series from my list of letters.

letters_series = pd.Series(letters)
letters_series

In [None]:
# Use .isin to check if each element in letters_series matches an element in my list of vowels.

letters_series.isin(vowels)

In [None]:
# Use my Series of boolean values to return the values that meet my condition.

letters_series[letters_series.isin(vowels)]

### `.apply` 

Sometimes there are more complicated operations that we want to perform, and we need to apply a function to each element in a Series. In this case, we can define a function that handles a single value and use the `.apply` method to apply the function to each element in a Series.

In [None]:
def even_or_odd(n):
    '''
    A function that takes a number and returns a string indicating whether the passed number is even or odd.
    
    >>> even_or_odd(3)
    'odd'
    >>> even_or_odd(2)
    'even'
    '''
    if n % 2 == 0:
        return 'even'
    else:
        return 'odd'

numbers_series.apply(even_or_odd)

Here we define a function, `even_or_odd`, then reference that function when we call `.apply`. Notice that when we reference the `even_or_odd` function, we are **not** calling the function, rather, we are passing the `even_or_odd` function itself to the `.apply` method as an argument, which pandas will then call on every element of the Series.

It is also very common to see lambda functions used along with `.apply`. We could re-write the above example with a lambda function like so:

In [None]:
numbers_series.apply(lambda n: 'even' if n % 2 == 0 else 'odd')

## Exercises Part II

 Explore more attributes and methods while you continue to work with the fruits Series.

1. Capitalize all the string values in fruits.

2. Count the letter "a" in all the string values (use string vectorization).

3. Output the number of vowels in each and every string value.
    

4. Write the code to get the longest string value from fruits.

5. Write the code to get the string values with 5 or more letters in the name.

6. Use the `.apply` method with a lambda function to find the fruit(s) containing the letter `"o"` two or more times.

7. Write the code to get only the string values containing the substring "berry".

8. Write the code to get only the string values containing the substring "apple".

9. Which string value contains the most vowels?

___

## Binning Data

I can bin continuous data to convert it to categorical data. We will look at two different ways to accomplish binning below.

In [None]:
s = pd.Series(list(range(15)))
s

### `pd.cut(s, bins=n)`

We can either specify the number of bins to create, and pandas will create bins of equal size, or we can specify the bin edges ourselves by passing a list of bin edges or cutoffs.

In [None]:
# Bin values into 3 equal-sized bins.

pd.cut(s, 3)

In [None]:
# Bin values into bins with the cutoffs I specify. The bins are no longer of equal size.

pd.cut(s, [-1, 3, 12, 16])

In [None]:
# How many values fall into each bin? I can chain on the value_counts method.

pd.cut(s, 3).value_counts()

### `value_counts(bins=n)`

The `value_counts` method can also be valuable here. It has a parameter named `bins`, which will allow us to quickly bin and group our data at the same time if that is our desired end goal.

In [None]:
s.value_counts(bins=3)

## Plotting Data

**The `.plot()` method** allows us to quickly visualize the data in a Series. It's built on top of Matplotlib!


- By default, Matplotlib will choose the best type of plot for us.


- We can also customize our plot using the paramters of the `.plot` method or by using Matplot lib if we like. We will look at examples of both ways below.

Check the docs [here](https://pandas.pydata.org/pandas-docs/version/0.24.2/reference/api/pandas.Series.plot.html) for more on the `.plot()` method.

In [None]:
# Matplotlib is choosing the plot for us here, and it might tell the story we want.

nums_series = pd.Series([1, 5, 5, 5, 10, 20, 100, 40])
nums_series.plot()

plt.show()

We can also use specific types of visualizations like this:

In [None]:
# So, here we specify the type of plot we would like Matplotlib to use.
nums_series.plot()

plt.show()

The `.value_counts` method returns a Series, so we can call `.plot` method on the resulting Series; this is called method chaining.

In [None]:
# Construct the Series.
lets_series = pd.Series(['a', 'b', 'a', 'c', 'b', 'a', 'd', 'a'])
lets_series.value_counts()

In [None]:
# Plot the value_counts of our Series. Rotate our x-tick values.
lets_series.value_counts().plot.bar(rot=0)

plt.show()

Any additional keyword arguments passed to pandas `.plot` method will be passed along to the corresponding Matplotlib functions. In addition, we can use Matplotlib the same way we have before to set titles, tweak axis labels, etc. Let's look at both ways.

In [None]:
# Use the parameters of the .plot method to customize my chart.

pd.Series(['a', 'b', 'a', 'c', 'b', 'a', 'd', 'a']).value_counts().plot.bar(title='Example Pandas Visualization', 
                                                                            rot=0, 
                                                                            color='firebrick', 
                                                                            ec='black',
                                                                            width=.9).set(xlabel='Letter',
                                                                                         ylabel='Frequency')

plt.show()

In [None]:
# Use Matplotlib to customize.

pd.Series(['a', 'b', 'a', 'c', 'b', 'a', 'd', 'a']).value_counts().plot.bar(color='firebrick', width=.9)
plt.title('Example Pandas Visualization')
plt.xticks(rotation=0)
plt.xlabel('Letter')
plt.ylabel('Frequency')

plt.show()

## Further Reading

- [pandas documentation: `Series`](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#series)

___

## Exercises Part III

Use pandas to create a Series named letters from the following string:

        'hnvidduckkqxwymbimkccexbkmqygkxoyndmcxnwqarhyffsjpsrabtjzsypmzadfavyrnndndvswreauxovncxtwzpwejilzjrmmbbgbyxvjtewqthafnbkqplarokkyydtubbmnexoypulzwfhqvckdpqtpoppzqrmcvhhpwgjwupgzhiofohawytlsiyecuproguy'

1. Which letter occurs the most frequently in the letters Series? 

2. Which letter occurs the least frequently?
    

3. How many vowels are in the Series?
    

4. How many consonants are in the Series?
    

5. Create a Series that has all of the same letters but uppercased.
    

6. Create a bar plot of the frequencies of the 6 most commonly occuring
      letters.

Use pandas to create a Series named numbers from the following list:

        ['$796,459.41', '$278.60', '$482,571.67', '$4,503,915.98', '$2,121,418.3', '$1,260,813.3', '$87,231.01', '$1,509,175.45', '$4,138,548.00', '$2,848,913.80', '$594,715.39', '$4,789,988.17', '$4,513,644.5', '$3,191,059.97', '$1,758,712.24', '$4,338,283.54', '$4,738,303.38', '$2,791,759.67', '$769,681.94', '$452,650.23']

7. What is the data type of the numbers Series?

8. How many elements are in the number Series?
    

9. Perform the necessary manipulations by accessing Series attributes and methods to convert the numbers Series to a numeric data type.
    

10. Run the code to discover the maximum value from the Series.

11. Run the code to discover the minimum value from the Series.

12. What is the range of the values in the Series?
    

13. Bin the data into 4 equally sized intervals or bins and output how many values fall into each bin.
    

14. Plot the binned data in a meaningful way. Be sure to include a title and axis labels.

Use pandas to create a Series named exam_scores from the following list:

        [60, 86, 75, 62, 93, 71, 60, 83, 95, 78, 65, 72, 69, 81, 96, 80, 85, 92, 82, 78]
        
15. How many elements are in the exam_scores Series?

16. Run the code to discover the minimum, the maximum, the mean, and the median scores for the exam_scores Series.
    

17. Plot the Series in a meaningful way and make sure your chart has a title and axis labels.

18. Write the code necessary to implement a curve for your exam_grades Series and save this as curved_grades. Add the necessary points to the highest grade to make it 100, and add the same number of points to every other score in the Series as well.
    

19. Use a method to convert each of the numeric values in the curved_grades Series into a categorical value of letter grades. For example, 86 should be a 'B' and 95 should be an 'A'. Save this as a Series named letter_grades.

20. Plot your new categorical letter_grades Series in a meaninful way and include a title and axis labels.

## More Practice

Revisit the exercises from [https://gist.github.com/ryanorsinger/f7d7c1dd6a328730c04f3dc5c5c69f3a](https://gist.github.com/ryanorsinger/f7d7c1dd6a328730c04f3dc5c5c69f3a). 

After you complete each set of Series exercises, use any extra time you have to pursue the challenge below. You can work on these in the same notebook or file as the Series exercises or create a new practice notebook you can work in a little every day to keep your python and pandas skills sharp by trying to solve problems in multiple ways. *These are not a part of the Series exercises grade, so don't worry if it takes you days or weeks to meet the challenge.*

**Challenge yourself to be able to...**

- solve each using vanilla python.
    
- solve each using list comprehensions.
    
- solve each by using a pandas Series for the data structure instead of lists and using vectorized operations instead of loops and list comprehensions.