In [1]:
import pandas as pd

In [2]:
list_of_dataframes = []

for year in [2007, 2008, 2009]:
    url = f'https://raw.githubusercontent.com/SimonCarryer/pandas_tutorial/master/data/pa_{year}_irs.csv'
    temp = pd.read_csv(url, thousands=',', index_col=0, names=['n_returns', 'income'], skiprows=2)
    temp['year'] = year
    list_of_dataframes.append(temp)
income = pd.concat(list_of_dataframes)

## Operations on Data

As we've seen before, Pandas always applies operations on `Series` objects to the whole Series. A trivial example of this is adding a number to a `Series`.

In [3]:
(income['year'] + 100).head()

15001    2107
15003    2107
15004    2107
15005    2107
15006    2107
Name: year, dtype: int64

Many operators work on `Series` objects, such as `+`, `-`, `*`, and `/`. Logical operators (`|` and `&`) also work on Series containing boolean values. You can also apply the same operators to a whole `DataFrame`.

In [4]:
(income / 100).head()

Unnamed: 0,n_returns,income,year
15001,183.06,7328.75,20.07
15003,66.39,2058.84,20.07
15004,1.96,67.35,20.07
15005,55.46,2998.94,20.07
15006,2.27,80.72,20.07


There's another way of operating on data in Pandas, which we've already seen. Instead of adding or subracting (or whatever) a single number, can use another `Series`.

In [5]:
(income['income'] / income['n_returns']).head()

15001    40.034688
15003    31.011297
15004    34.362245
15005    54.073927
15006    35.559471
dtype: float64

What happens in these cases is an _implicit join_, along the index of the `Series`. In other words, it lines up pairs of values according to their index, and then performs the operation on every pair of values. As with everything else in Pandas, you can do this any time indexes align. Let's add a random number to every year in our `income` DataFrame.

In [6]:
randoms = pd.Series(pd.np.random.randint(0, 100, size=len(income)), index=income.index)
(income['year'] + randoms).head()

15001    2054
15003    2045
15004    2079
15005    2078
15006    2102
dtype: int64

It's worth noting that these operations are employing a little bit of shorthand. "Under the hood", using operators like this is calling equivalent methods on the object. The `+` operator is, behind the scenes, making a call to `Series.add()`, for example. In some cases, it will be useful to explicitly call the method yourself, as those methods can take additional arguments.

In [7]:
income['income'].divide(income['n_returns']).head()

15001    40.034688
15003    31.011297
15004    34.362245
15005    54.073927
15006    35.559471
dtype: float64

The above is exactly the same as using the `/` operator.

In [8]:
(income['income'] / income['n_returns']).head()

15001    40.034688
15003    31.011297
15004    34.362245
15005    54.073927
15006    35.559471
dtype: float64

My advice is - for the sake of brevity and clarity - use the operators wherever possible rather than the explicit methods. 

#### A NOTE ON BRACKETS

When you perform an operation on a `Series`, it returns a new `Series` object (it doesn't change the data in the original `Series`. That means that you can chain operations together - like adding two `Series` together and then dividing by two. The tricky part is ensuring that your subsequent operations are performed on the result of the first operation, not on just one of the `Series`. Brackets let you specify the order of operations. Consider the difference between the following two operations:

In [9]:
(income['income']/income['n_returns']*2).head()

15001     80.069376
15003     62.022594
15004     68.724490
15005    108.147854
15006     71.118943
dtype: float64

In [10]:
(income['income']/(income['n_returns']*2)).head()

15001    20.017344
15003    15.505648
15004    17.181122
15005    27.036964
15006    17.779736
dtype: float64

You'll note that we've already been using brackets to specify the order of operations every time we use the `head` method in the previous examples.

This chaining together of operations does let you perform quite complex operations very concisely. Here, for example, we're making a `DataFrame` that's got the average income for each year in a seperate column.

In [11]:
(income['income']/income['n_returns']).groupby([income.index, income['year']]).sum().unstack().head()

year,2007,2008,2009
15001,40.034688,44.862339,
15003,31.011297,35.51651,35.732487
15004,34.362245,,
15005,54.073927,57.256456,57.972025
15006,35.559471,,


But! Anyone else reading the above line (including you when you come back to it a few weeks from now) will struggle to understand what it's doing. It's also hard to check the results of the intervening steps to make sure it's working as expected. I'd suggest stepping out the operation a little more explicitly.

In [12]:
income['average_income'] = income['income']/income['n_returns']
grouped_by_zip = income.groupby([income.index, 'year'])['average_income'].sum()
yearly_income = grouped_by_zip.unstack()
yearly_income.head()

year,2007,2008,2009
15001,40.034688,44.862339,
15003,31.011297,35.51651,35.732487
15004,34.362245,,
15005,54.073927,57.256456,57.972025
15006,35.559471,,


### Problems

- The income column is in "thousands of dollars". Convert it to just "dollars".
- What zip codes had the greatest year-on-year growth in income between 2008 and 2009?

## Putting data in bins (where it belongs, frankly)

A common thing you'll want to do (especially for charting, which we'll cover later) is "bin" data, i.e. group it according to some numerical thresholds. Pandas provides a bunch of very handy functions for doing this, but the way they work is a bit confusing. First, lets look at the `cut` function.

In [13]:
pd.cut(income['average_income'], bins=10).head(10)

15001    (-0.356, 53.609]
15003    (-0.356, 53.609]
15004    (-0.356, 53.609]
15005    (53.609, 107.04]
15006    (-0.356, 53.609]
15007    (-0.356, 53.609]
15009    (-0.356, 53.609]
15010    (-0.356, 53.609]
15012    (-0.356, 53.609]
15014    (-0.356, 53.609]
Name: average_income, dtype: category
Categories (10, interval[float64]): [(-0.356, 53.609] < (53.609, 107.04] < (107.04, 160.471] < (160.471, 213.902] ... (320.764, 374.195] < (374.195, 427.626] < (427.626, 481.057] < (481.057, 534.488]]

What's going on here? The `cut` function takes a `Series` and a number of bins, and returns a `Series` containing `Categories` objects. It's most useful when you combine that with our old friend `value_counts`:

In [14]:
pd.cut(income['average_income'], bins=10).value_counts()

(-0.356, 53.609]      3859
(53.609, 107.04]       750
(107.04, 160.471]      114
(160.471, 213.902]      32
(213.902, 267.333]       8
(267.333, 320.764]       5
(320.764, 374.195]       4
(481.057, 534.488]       3
(374.195, 427.626]       3
(427.626, 481.057]       1
Name: average_income, dtype: int64

By default, `cut` makes bins of equal sizes, in terms of the values being "cut" - in other words, each bin covers the same range of income values. If you want bins which contain equal numbers of _rows_ instead, there's another function for that, `qcut` - short for "quantile cut".

In [15]:
pd.qcut(income['average_income'], q=10).value_counts() #note that the argument is "q", not "bins".

(66.477, 534.488]    478
(52.872, 66.477]     478
(47.505, 52.872]     478
(43.872, 47.505]     478
(38.137, 40.898]     478
(35.643, 38.137]     478
(32.799, 35.643]     478
(28.861, 32.799]     478
(0.177, 28.861]      478
(40.898, 43.872]     477
Name: average_income, dtype: int64

Because `cut` and `qcut` return a `Series`, and that `Series` has an index which aligns with the "income" `DataFrame`, we can use them to group the `DataFrame`. This is particularly useful with `qcut`.

In [16]:
income.groupby(pd.qcut(income['average_income'], q=3))['average_income'].mean()

average_income
(0.177, 36.458]      30.441317
(36.458, 46.126]     41.013184
(46.126, 534.488]    70.591524
Name: average_income, dtype: float64

There are a few other handy things you can do with binning data. You can supply your own bins - but be aware that if a row falls outside of your supplied bins, it will recieve a `NaN` category, which will be excluded from any grouping.

In [17]:
manual_bins = [0, 50, 100, 200, 2000]
pd.cut(income['average_income'], bins=manual_bins).value_counts()

(0, 50]        3584
(50, 100]      1013
(100, 200]      153
(200, 2000]      29
Name: average_income, dtype: int64

You can also supply your own labels, which can make things much more readable.

In [18]:
manual_labels = ['low', 'medium', 'high']
income.groupby(pd.qcut(income['average_income'], q=3, labels=manual_labels))['average_income'].mean()

average_income
low       30.441317
medium    41.013184
high      70.591524
Name: average_income, dtype: float64

### Problems

- Group the income data into 10 equal-sized bins, based on the number of returns filed.
- How many returns were filed in zip codes within the lowest 25% of average income?

## Applying custom functions

You can achieve a heck of a lot by chaining together various operations on `Series` objects. In general, it's better to use these functions where you can. They're heavily optimised, so you're likely to get much better performance. However, every so often you come across a case where what you need to do isn't supported by any simple operation, and then you might need to employ the `apply` method.

`apply`, simply, takes a function you specify, and executes it for every item in your `Series` or `DataFrame`. To demonstrate this, we're going to go back to our Dog Registrations dataset.

In [19]:
dogs = pd.read_csv('https://raw.githubusercontent.com/SimonCarryer/pandas_tutorial/master/data/dog_registrations.csv')

In [20]:
dogs.head()

Unnamed: 0,LicenseType,Breed,Color,DogName,OwnerZip,ExpYear,ValidDate
0,Dog Individual Female,AM PIT BULL TERRIER,SPOTTED,BUTTER,15001,2007,5/1/2007 15:15
1,Dog Individual Female,AM PIT BULL TERRIER,BROWN,SABLE,15001,2007,5/1/2007 15:15
2,Dog Individual Neutered Male,MIXED,.,YIP,15001,2007,4/11/2007 15:14
3,Dog Individual Male,DOBERMAN PINSCHER,RED,SABER,15003,2007,4/5/2007 15:00
4,Dog Individual Spayed Female,MIXED,BLACK,DAISY,15003,2007,5/25/2007 12:15


In [21]:
def sex(licensetype):
    return 'Male' if 'Male' in licensetype else 'Female'

dogs['LicenseType'].apply(sex).head()

0    Female
1    Female
2      Male
3      Male
4    Female
Name: LicenseType, dtype: object

`apply` can also work row-wise on a DataFrame. These operations can get very slow, so think carefully about whether you can achieve the same result by operating over some `Series` objects instead.

In [22]:
def dog_description(row):
    if row['Color'][0] in 'AEIOU':
        article = 'An'
    else:
        article = 'A'
    if row['Breed'] == 'MIXED':
        breed = 'mutt'
    else:
        breed = row['Breed'].title()
    if 'Male' in row['LicenseType']:
        sex = 'male' 
    else:
        sex = 'female'
    name = row['DogName'].title()
    return f"{article} {row['Color'].lower()} {sex} {breed} called {name}"

dogs.iloc[0:20].apply(dog_description, axis=1)

0     A spotted female Am Pit Bull Terrier called Bu...
1       A brown female Am Pit Bull Terrier called Sable
2                              A . male mutt called Yip
3             A red male Doberman Pinscher called Saber
4                      A black female mutt called Daisy
5                    A spotted male mutt called Scooter
6               A multi female Rat Terrier called Tinky
7        A black/brown female Ger Shepherd called Amica
8                  A tan female Pomeranian called Taffy
9                  A spotted female Beagle called Belle
10                 A spotted female Beagle called Belle
11            A white female Am Eskimo Dog called Sasha
12         A black/brown male Collie Mix called Grizzly
13      A black/brown male Ger Shepherd called Hercules
14          A white/black male Sib Husky called Daytona
15                    A brown female Boxer called Chloe
16                   A white/brown male mutt called Max
17        A white/black female Aus Shepherd call

### Problems

* Trim the word "Dog" from the start of the `licensetype` column.
* Use the `apply` function to double the value of `average_income` in the `income` DataFrame, but only if the `year` is 2007.
* EXTRA FOR EXPERTS: Do the above problem again, but this time without using the `apply` function.