# Basic operations on Series and Dataframes

In [2]:
import pandas as pd

In [3]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
try:
    import seaborn
except ImportError:
    pass

As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes.


In [4]:
# redefining the example objects

population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, 
                        'United Kingdom': 64.9, 'Netherlands': 16.9})

countries = pd.DataFrame({'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']})

## Elementwise-operations (like numpy)

Just like with numpy arrays, many operations are element-wise:

In [5]:
population / 100

Belgium           0.113
France            0.643
Germany           0.813
Netherlands       0.169
United Kingdom    0.649
dtype: float64

In [6]:
countries['population'] / countries['area']

0    0.000370
1    0.000096
2    0.000228
3    0.000407
4    0.000265
dtype: float64

In [7]:
np.log(countries['population'])

0    2.424803
1    4.163560
2    4.398146
3    2.827314
4    4.172848
Name: population, dtype: float64

### Alignment! (unlike numpy)

Only, pay attention to **alignment**: operations between series will align on the index:  

In [30]:
s1 = population[['Belgium', 'France']]
s2 = population[['France', 'Germany']]

In [31]:
s1

Belgium    11.3
France     64.3
dtype: float64

In [32]:
s2

France     64.3
Germany    81.3
dtype: float64

In [33]:
s1 + s2

Belgium      NaN
France     128.6
Germany      NaN
dtype: float64

## Aggregations (reductions)

Pandas provides a large set of summary functions that operate on different kinds of pandas objects (DataFrames, Series, Index) and produce single value. When applied to a DataFrame, the result is returned as a pandas Series (one value for each column). 

The average population number:

In [34]:
population.mean()

47.739999999999995

The minimum area:

In [35]:
countries['area'].min()

30510

For dataframes, often only the numeric columns are included in the result:

In [36]:
countries.median()

area          244820.0
population        64.3
dtype: float64

<div class="alert alert-success">
    <b>EXERCISE</b>: Calculate the population numbers relative to Belgium
</div>

In [37]:
population / population['Belgium'].mean()

Belgium           1.000000
France            5.690265
Germany           7.194690
Netherlands       1.495575
United Kingdom    5.743363
dtype: float64

<div class="alert alert-success">
    <b>EXERCISE</b>: Calculate the population density for each country and add this as a new column to the dataframe.
</div>

In [38]:
countries['population']*1000000 / countries['area']

0    370.370370
1     95.783158
2    227.699202
3    406.973944
4    265.092721
dtype: float64

In [39]:
countries['density'] = countries['population']*1000000 / countries['area']
countries

Unnamed: 0,area,capital,country,population,density
0,30510,Brussels,Belgium,11.3,370.37037
1,671308,Paris,France,64.3,95.783158
2,357050,Berlin,Germany,81.3,227.699202
3,41526,Amsterdam,Netherlands,16.9,406.973944
4,244820,London,United Kingdom,64.9,265.092721


## Some other useful methods

Sorting the rows of the DataFrame according to the values in a column:

In [40]:
countries.sort_values('density', ascending=False)

Unnamed: 0,area,capital,country,population,density
3,41526,Amsterdam,Netherlands,16.9,406.973944
0,30510,Brussels,Belgium,11.3,370.37037
4,244820,London,United Kingdom,64.9,265.092721
2,357050,Berlin,Germany,81.3,227.699202
1,671308,Paris,France,64.3,95.783158


One useful method to use is the ``describe`` method, which computes summary statistics for each column:

In [41]:
countries.describe()

Unnamed: 0,area,population,density
count,5.0,5.0,5.0
mean,269042.8,47.74,273.183879
std,264012.827994,31.519645,123.440607
min,30510.0,11.3,95.783158
25%,41526.0,16.9,227.699202
50%,244820.0,64.3,265.092721
75%,357050.0,64.9,370.37037
max,671308.0,81.3,406.973944


## Other features

* Working with missing data (`.dropna()`, `pd.isnull()`)
* Merging and joining (`concat`, `join`)
* Grouping: `groupby` functionality
* Reshaping (`stack`, `pivot`)
* Time series manipulation (resampling, timezones, ..)
* Easy plotting

There are many, many more interesting operations that can be done on Series and DataFrame objects, but rather than continue using this toy data, we'll instead move to a real-world example, and illustrate some of the advanced concepts along the way.

See the next notebooks!

## Acknowledgement

> *© 2015, Stijn Van Hoey and Joris Van den Bossche  (<mailto:stijnvanhoey@gmail.com>, <mailto:jorisvandenbossche@gmail.com>). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*

> This notebook is partly based on material of Jake Vanderplas (https://github.com/jakevdp/OsloWorkshop2014).

---