# Series Methods More

## Overview

In the previous chapter, we covered the most essential and common attributes along with the statistical methods for pandas Series objects. In this chapter, we cover several other useful and common methods from the [Series API][1].

Let's begin by reading in the movie dataset and selecting the `duration` series, which contains the length of each movie in minutes.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#series

In [1]:
import pandas as pd
movie = pd.read_csv('../data/movie.csv', index_col='title')
duration = movie['duration']
duration.head()

title
Avatar                                        178.0
Pirates of the Caribbean: At World's End      169.0
Spectre                                       148.0
The Dark Knight Rises                         164.0
Star Wars: Episode VII - The Force Awakens      NaN
Name: duration, dtype: float64

## Methods for handling missing values
pandas provides the following methods to handle missing values:

* `isna` - Returns a Series of booleans based on whether each value is missing or not
* `notna` - Exact opposite of `isna`
* `fillna` - Fills missing values in a variety of ways
* `dropna` - Drops the missing values from the Series

### Counting the number of missing values
pandas doesn't have a single method that counts the number of missing values, so you can find it in two ways. 

* Use the `count` method to find the number of non-missing values and subtract this from the total number of values
* Use the `isna` method to return a Series of booleans and chain the `sum` method

In [2]:
len(duration) - duration.count()

15

In [3]:
duration.isna().sum()

15

### Finding the percentage of missing values
To find the percentage of missing values in a Series we can chain the `mean` method to the `isna` method.

In [4]:
duration.isna().mean()

0.0030512611879576893

### Alternate calculation
The last calculation might be confusing. We could have been more explicit and calculated the percentage of missing values by dividing the number missing by the total size of the Series as done below.

In [5]:
total = len(duration)
num_missing = total - duration.count()
num_missing / total

0.0030512611879576893

### Why does taking the mean of the boolean Series work?
The mean is defined as the sum of all values divided by the total number of values. In the case of a boolean Series, it's sum is just the number of `True` values and in this specific example is equal to the number of missing values.

## Filling missing values
There are a number of ways that have been developed to fill missing values. Some of these are quite complex and involve using machine learning. pandas only provides a couple simple choices with the `fillna` method. In this section, we will cover how to fill in missing values with a constant. Let's output the duration Series again and see that the fifth value is missing.

In [6]:
duration.head()

title
Avatar                                        178.0
Pirates of the Caribbean: At World's End      169.0
Spectre                                       148.0
The Dark Knight Rises                         164.0
Star Wars: Episode VII - The Force Awakens      NaN
Name: duration, dtype: float64

Pass the `fillna` method a scalar value to fill all the missing values with that number.

In [7]:
duration.fillna(999).head()

title
Avatar                                        178.0
Pirates of the Caribbean: At World's End      169.0
Spectre                                       148.0
The Dark Knight Rises                         164.0
Star Wars: Episode VII - The Force Awakens    999.0
Name: duration, dtype: float64

A common strategy is to use the mean or median as the missing value replacement. Let's find the median of the Series.

In [10]:
median = duration.median()
median

103.0

Now, we can fill in all of the missing values with the median.

In [9]:
duration.fillna(median).head()

title
Avatar                                        178.0
Pirates of the Caribbean: At World's End      169.0
Spectre                                       148.0
The Dark Knight Rises                         164.0
Star Wars: Episode VII - The Force Awakens    103.0
Name: duration, dtype: float64

## Dropping missing values
The `dropna` method simply removes the values from the Series that are missing. Notice that the last value has been removed.

In [11]:
duration2 = duration.dropna()
duration2.head()

title
Avatar                                      178.0
Pirates of the Caribbean: At World's End    169.0
Spectre                                     148.0
The Dark Knight Rises                       164.0
John Carter                                 132.0
Name: duration, dtype: float64

Above, we calculated that there were 15 missing values in the Series. Let's verify that the length of the new Series has decreased by this amount.

In [12]:
len(duration2)

4901

In [13]:
len(duration)

4916

## Sorting
The `sort_values` method sorts the Series from least to greatest by default. It places missing values at the end.

In [15]:
duration.sort_values().head(3)

title
The Touch           7.0
Shaun the Sheep     7.0
Robot Chicken      11.0
Name: duration, dtype: float64

Set the `ascending` parameter to `False` to sort from greatest to least.

In [14]:
duration.sort_values(ascending=False).head(3)

title
Trapped                511.0
Carlos                 334.0
Blood In, Blood Out    330.0
Name: duration, dtype: float64

### Sorting the index
Since Series also have an index, pandas allows you to sort by it as well with the `sort_index` method.

In [16]:
duration.sort_index().head(3)

title
#Horror                  101.0
10 Cloverfield Lane      104.0
10 Days in a Madhouse    111.0
Name: duration, dtype: float64

In [17]:
duration.sort_index(ascending=False).head(3)

title
Æon Flux                    93.0
xXx: State of the Union    101.0
xXx                        132.0
Name: duration, dtype: float64

## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">What percentage of actor 1 Facebook likes are missing?</span>

In [20]:
actor1 = movie['actor1_fb']
actor1.isna().sum()/len(actor1)

0.0014239218877135883

### Exercise 2
<span  style="color:green; font-size:16px">Use the `notna` method to find the number of non-missing values in the actor 1 Facebook like column. Verify this number is the same as the `count` method.</span>

In [22]:
actor1.notna().sum()

4909

### Exercise 3
<span  style="color:green; font-size:16px">Use one line of code to fill the missing values of `actor1_fb` with the maximum of `actor2_fb`. Save this result to variable `actor1_fb_full`</span>

In [43]:
actor_max = actor1.fillna(movie['actor2_fb'].max())
actor_max


title
Avatar                                          1000.0
Pirates of the Caribbean: At World's End       40000.0
Spectre                                        11000.0
The Dark Knight Rises                          27000.0
Star Wars: Episode VII - The Force Awakens       131.0
John Carter                                      640.0
Spider-Man 3                                   24000.0
Tangled                                          799.0
Avengers: Age of Ultron                        26000.0
Harry Potter and the Half-Blood Prince         25000.0
Batman v Superman: Dawn of Justice             15000.0
Superman Returns                               18000.0
Quantum of Solace                                451.0
Pirates of the Caribbean: Dead Man's Chest     40000.0
The Lone Ranger                                40000.0
Man of Steel                                   15000.0
The Chronicles of Narnia: Prince Caspian       22000.0
The Avengers                                   26000.0
Pira

### Exercise 4
<span  style="color:green; font-size:16px">Verify the results of problem 3 by selecting just the values of `actor1_fb_full` that were filled by `actor2_fb`.</span>

In [44]:
actor_max[movie.actor1_fb.isna()]

title
Pink Ribbons, Inc.         137000.0
Sex with Strangers         137000.0
The Harvest/La Cosecha     137000.0
Ayurveda: Art of Being     137000.0
The Brain That Sings       137000.0
The Blood of My Brother    137000.0
Counting                   137000.0
Name: actor1_fb, dtype: float64