# Series Attributes and Statistical Methods

Our main accomplishment up to this point has been selecting subsets of data. We have not changed the data or made many interesting calculations. Our selections have happened in two ways:

* Selection by label and integer location
* Selection by actual values (boolean selection)

Both of these methods involve using Python's selection operator, the brackets `[]`.

## Calling methods on a Series/DataFrame
Selecting subsets of data does not usually require calling methods using dot notation. In this chapter we will call many methods that will perform actions on our DataFrame. We have actually already called some methods such as the `head`, `tail`, `isna`, and `set_index` methods. There are around 250 methods available to both DataFrames and Series.

### Use a subset of methods
It can be quite overwhelming to think about having to learn and memorize this staggering amount of functionality. The good news is that many of these methods are quite unnecessary and don't add any extra functionality. Furthermore, many methods are remnants from the early days of pandas and have few/no use cases or have been **deprecated**. When a method is deprecated, then it is both discouraged from being used and will likely be removed from the library in the future.

### Minimally sufficient pandas
I suggest using a subset of the pandas library that allows you to do as many tasks as possible. I focus on the subset of pandas that maximizes both performance and readability. Since there is so much functionality, power users of pandas can think of very creative and complex code to accomplish different tasks. This is not necessarily a positive thing and when working with a group of other data scientists can lead to confusion for those that are not familiar with the syntax. One of my most popular blog posts is titled [Minimally Sufficient Pandas][3] and goes into great detail on this.

## Begin with the Series
We will begin by our exploration of attributes and methods with Series objects. 

### View the API for a complete list of functionality
Modern programming languages use the term **Application Programming Interface** or **API** to list and describe all the possible functionality therein. The pandas API reference can be found [here][1]. This is a huge list, but as mentioned above, only a subset of this page is needed for the vast majority of tasks.

### The best of the pandas Series API
The pandas Series object is a single dimension of data and easier to work with than an entire DataFrame. We start with it and cover the most basic and important methods below. You may find it useful to navigate to the [Series API][2] section of the documentation so that you can have a full list of the functionality.

### City of Houston Employee Data
We will use a small public dataset containing City of Houston employee information on their position, race, gender, and salary. Notice that the column `hire_date` can be read in as a datetime.

[1]: http://pandas.pydata.org/pandas-docs/stable/reference/index.html
[2]: http://pandas.pydata.org/pandas-docs/stable/reference/series.html
[3]: https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428

In [1]:
import pandas as pd
emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date'])
emp.head(3)

Unnamed: 0,title,dept,salary,race,gender,hire_date
0,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Male,2015-02-03
1,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Male,1982-02-08
2,SENIOR POLICE OFFICER,Houston Police Department-HPD,66614.0,Black,Male,1984-11-26


### Select a single column as a Series
Let's select the `salary` column as a Series and use it to explore the Series API.

In [2]:
salary = emp['salary']
salary.head()

0    45279.0
1    63166.0
2    66614.0
3    71680.0
4    42390.0
Name: salary, dtype: float64

Let's verify that we do have a Series object.

In [3]:
type(salary)

pandas.core.series.Series

## Core Series Attributes
pandas Series have [many attributes][1], but only a few are important to know. The attributes to be aware of are:

* `index`
* `values`
* `size`
* `dtype`

The `index` and `values` were covered in a previous chapter. Only `size` and `dtype` are new. The size represents the total number of values in the Series. The `dtype` returns the data type of the values. Remember that all values in the Series will share the same data type. Let's display these now.

[1]: http://pandas.pydata.org/pandas-docs/stable/reference/series.html#attributes

In [10]:
salary.size

1653

In [6]:
salary.dtype

dtype('float64')

### `len` function instead of `size` attribute
The built-in `len` function returns the same number as the `size` attribute. 

In [11]:
len(salary)

1653

Even though they both report the same number, I typically use the `len` function, as it returns the same number of rows when used on a DataFrame. The `size` attribute for a DataFrame returns the total number of values.

In [12]:
len(emp)

1653

In [13]:
emp.size

9918

## Arithmetic operators
Series have the ability to work with all of the following common arithmetic operators:

* `+` - Addition
* `-` - Subtraction
* `*` - Multiplication
* `/` - Division
* `//` - Floor division
* `**` - Exponentiation
* `%` - Modular division (returns the remainder)

All of the arithmetic operators operate on every value in the Series. Let's see some examples.

Let's add 5 to every value in the Series.

In [14]:
result = salary + 5
result.head(3)

0    45284.0
1    63171.0
2    66619.0
Name: salary, dtype: float64

Raise each value in the Series to the .2 power.

In [15]:
result = salary ** .2
result.head(3)

0    8.534525
1    9.122139
2    9.219622
Name: salary, dtype: float64

Divide each value in the Series by 173. This single division sign is referred to as **true division** and returns all decimal values.

In [16]:
result = salary / 173
result.head(3)

0    261.728324
1    365.121387
2    385.052023
Name: salary, dtype: float64

Two division signs are used for **floor division**. The decimals are truncated (and not rounded) from the result.

In [17]:
result = salary // 173
result.head(3)

0    261.0
1    365.0
2    385.0
Name: salary, dtype: float64

### Isn't this chapter about calling methods?
Although the above operations are not actual methods and do not use the dot notation, they work similarly as methods. You can think of them as methods that take exactly one parameter, the other object that is being operated on.

### Arithmetic operations are vectorized
All the above arithmetic operations are **vectorized**. This means that each operation was applied to each value in the Series without an explicit writing of a `for` loop. Python lists do not work like this and require an explicit for loop to operate on each value.


## Comparison Operations

The following six comparison operators work similarly as their arithmetic analogs from above:

* `< ` - Less than
* `<=` - Less than or equal to
* `> ` - Greater than
* `>=` - Greater than or equal to
* `==` - Equals to
* `!=` - Not equal to

In the boolean selection chapters, we used these vectorized comparison operations (without the terminology) to produce Series of booleans. Let's see a few examples below.

In [18]:
result = salary > 50000
result.head(3)

0    False
1     True
2     True
Name: salary, dtype: bool

In [19]:
result = salary != 71680
result.head(3)

0    True
1    True
2    True
Name: salary, dtype: bool

## Boolean and bitwise operators

Python has three boolean operators, the keywords `and`, `or`, and `not`. These operators are syntactically unable to do vectorized boolean operations. Instead, pandas and numpy rely on the bitwise and, or, and not operators, respectively `&`, `|`, and `~` to perform vectorized boolean operations. They were thoroughly covered in the preceding chapters.

## Statistical methods
We will now call *actual* methods that compute [basic descriptive statistics][1] on a numerical Series. We will do so explicitly with dot notation. It is useful to place these methods into two categories - those that **aggregate** and those that do not.

### Aggregation methods
A method that performs an aggregation returns a **single** number to summarize the Series. Examples of methods that aggregate are:

* `sum`
* `min`
* `max`
* `mean`
* `median`
* `std` - standard deviation
* `var` - variance
* `count` - returns number of non-missing values
* `describe` - returns most of the above aggregations in one Series
* `quantile` - returns the given percentile of the distribution

### Non-aggregation methods
Any other method that does not return a single value is not an aggregation. Some examples of these methods are:

* `abs` - takes absolute value
* `round` - round to the nearest given decimal place
* `cummin` - cumulative minimum
* `cummax` - cumulative maximum
* `cumsum` - cumulative sum
* `rank` - rank values in a variety of different ways
* `diff` - difference between one element and another
* `pct_change` - percent change from one element to another

[1]: http://pandas.pydata.org/pandas-docs/stable/reference/series.html#computations-descriptive-stats

## Aggregation methods
Let's see a few examples of common aggregation methods. Let's begin by summing every value in the Series.

In [20]:
salary.sum()

87003304.0

Get the minimum value of a Series.

In [21]:
salary.min()

24960.0

Get the maximum value of a Series.

In [22]:
salary.max()

210588.0

### pandas ignores missing values by default
One big difference between pandas and numpy is that pandas ignores missing values by default. When calling aggregation methods such as `sum` or `mean`, pandas ignores any missing value as if that piece of data did not exist. numpy returns `nan` for its aggregation methods when one or more values are missing. Let's verify this by extracting the values of `salary` as a numpy array and then calling the array `sum` method.

In [23]:
salary.values.sum()

nan

We can make pandas Series behave like numpy by setting the `skipna` parameter to `False`. All of the statistical methods have the `skipna` parameter available.

In [24]:
salary.sum(skipna=False)

nan

### The `count` method

The `count` method returns the number of non-missing values. Since this number is less than `len(salary)`, we know missing values exist.

In [25]:
salary.count()

1551

In [27]:
len(salary) #it will also count missing values

1653

## Non-Aggregation methods
Taking the absolute value or rounding are two examples of methods that get applied to each individual value in the Series independently. They return a Series the same length as the original and thus are not performing an aggregation. The `abs` method takes the absolute value of each value in the Series. Since none of the values in our salary Series are negative, it remains the same.

In [28]:
salary.abs().head(3)

0    45279.0
1    63166.0
2    66614.0
Name: salary, dtype: float64

The `round` method rounds each value to the nearest given decimal place. Use the `decimals` parameter to determine the place of the rounding. Negative numbers may be used to round places to the left of the decimal as we do in the next example.

In [29]:
# round to the nearest thousand
salary.round(decimals=-3).head(3)

0    45000.0
1    63000.0
2    67000.0
Name: salary, dtype: float64

### Accumulation methods
There are a few accumulation methods that work by keeping track of previous data. For instance, the `cummin` method keeps track of the current minimum value in the Series. It begins at the top with the first value. Since it's the first, it will be the minimum. It then continues down the Series to the second value. If the second value is less than the first, then it will be the new minimum. If not, then the first value will remain as the minimum. It returns a Series the same length as the original of all the current minimums.

In [None]:
salary.cummin().head()

### Non-aggregation methods return an entirely new Series
The non-aggregation methods return an entirely new Series and do not modify the calling Series. This is a crucial concept to understand. pandas has only a few operations and methods that modify objects in-place. Nearly all of the time, a new object is returned. Let's verify that the calling object has not changed.

In [30]:
salary_round = salary.round(decimals=-3)
salary_round.head(3)

0    45000.0
1    63000.0
2    67000.0
Name: salary, dtype: float64

The `salary` Series is the calling object, i.e., the one calling a method and remains unchanged.

In [31]:
salary.head(3)

0    45279.0
1    63166.0
2    66614.0
Name: salary, dtype: float64

## Operations on a boolean Series
One nice property of boolean Series is that their values evaluate to 0/1. `False` evaluates to 0 and `True` evaluates to 1. This makes for some nice shortcuts when answering some queries. Let's create a boolean Series and determine whether an employee is white or not.

In [32]:
race = emp['race'] 
filt = race == 'White'
filt.head(3)

0     True
1     True
2    False
Name: race, dtype: bool

If we are interested in the number of employees that are white, we could do boolean selection like this and then find the length of the result.

In [33]:
white_emp = race[filt]
white_emp.head(3)

0    White
1    White
4    White
Name: race, dtype: object

In [34]:
len(white_emp)

600

### Sum a boolean Series
Instead, we can just sum the boolean Series as each white employee will be `True` which is equivalent to 1. Therefore summing a boolean Series counts the number of values that are `True`.

In [35]:
filt.sum()

600

We can even shorten this to a single line of code.

In [36]:
(emp['race'] == 'White').sum()

600

### Explanation of this one line of code
Let's examine the line `(emp['race'] == 'White').sum()`. Python first evaluates the expression in parentheses - `emp['race'] == 'White'`. This results in a Series, which has all the available methods as any other Series. We then call the `sum` method on this Series to get the desired result.

## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">Read in the movie dataset with the title as the index and assign the `imdb_score` as a Series to variable `score`. Output the first 5 values.</span>

In [3]:
import pandas as ps
dataFrame = ps.read_csv('../data/movie.csv',index_col=0)
imdb = dataFrame['imdb_score']
imdb.head(5)

title
Avatar                                        7.9
Pirates of the Caribbean: At World's End      7.1
Spectre                                       6.8
The Dark Knight Rises                         8.5
Star Wars: Episode VII - The Force Awakens    7.1
Name: imdb_score, dtype: float64

### Exercise 2
<span  style="color:green; font-size:16px">What is the data type of `score` and how many values does it contain?</span>

In [51]:
imdb.dtype

dtype('float64')

### Exercise 3
<span  style="color:green; font-size:16px">What is the maximum and minimum score?</span>

In [52]:
imdb.min()

1.6

### Exercise 4
<span  style="color:green; font-size:16px">How many missing values are there in the `score` Series?</span>

In [60]:
len(imdb) - imdb.count() 

0

### Exercise 5
<span  style="color:green; font-size:16px">Read the docstrings on how the `rank` method works and then rank the first 10 values in `score` from highest to lowest.</span>

In [67]:
imdb.sort_values(ascending=False).head(10)

title
Towering Inferno                  9.5
The Shawshank Redemption          9.3
The Godfather                     9.2
Dekalog                           9.1
Kickboxer: Vengeance              9.1
The Dark Knight                   9.0
Fargo                             9.0
The Godfather: Part II            9.0
The Good, the Bad and the Ugly    8.9
12 Angry Men                      8.9
Name: imdb_score, dtype: float64

### Exercise 6
<span  style="color:green; font-size:16px">How many movies have scores greater than 6? (Remember that True/False evaluates to 1/0)</span>

In [6]:
(imdb > 6).sum()

3368

### Exercise 7
<span  style="color:green; font-size:16px">How many movies have scores greater than 4 and less than 7?</span>

In [18]:
((4 < imdb) & (imdb < 7)).sum()

3021

### Exercise 8
<span  style="color:green; font-size:16px">Find the difference between the median and mean of the scores.</span>

In [15]:
imdb.median() - imdb.mean()

0.1625711960943912

### Exercise 9
<span  style="color:green; font-size:16px">Add 1 to every value of `score` and then calculate the median.</span>

In [19]:
(imdb + 1).median()

7.6

### Exercise 10
<span  style="color:green; font-size:16px">Calculate the median of `score` and add 1 to this. Why is this value the same as Exercise 9?</span>

In [20]:
imdb.median()+1

7.6

### Exercise 11
<span  style="color:green; font-size:16px">Return a Series that has only scores above the 99.9th percentile</span>

In [26]:
imdb[imdb > imdb.quantile(0.999)]

title
The Shawshank Redemption    9.3
Towering Inferno            9.5
Dekalog                     9.1
The Godfather               9.2
Kickboxer: Vengeance        9.1
Name: imdb_score, dtype: float64