# Pandas
Pandas has two main parts:
* Series
* DataFrame: similar to excel structure
## Pandas - Series

In [1]:
import pandas as pd
import numpy as np

We'll start analyzing 'The Group of Seven' which is a political formed by Canada, Frane, Germany, Italy, Japan, the United Kingdom and the US. We'll
start by analyzing population, and for that, we'll use a pandas.Series object

In [75]:
# In millions
g7_pop = pd.Series([35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523])
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
dtype: float64

And set the name for the Series, for the bette document purpose.


In [21]:
g7_pop.name ='G7 Population in millions'

In [22]:
g7_pop

Canada      35.467
France      63.951
Germany     80.940
Italy       60.665
Japan      127.061
UK          64.511
US         318.523
Name: G7 Population in millions, dtype: float64

Series are pretty similar to numpy arrays

In [8]:
g7_pop.dtype, g7_pop.values

(dtype('float64'),
 array([ 35.467,  63.951,  80.94 ,  60.665, 127.061,  64.511, 318.523]))

In [None]:
type(g7_pop.values)

They look like simple Python list or Numpy array. But they're actually more similar to dict

A series has an index, that's similar to the automatic index assigned to Python's list:

In [9]:
g7_pop[0], g7_pop[1], g7_pop.index

(35.467, 63.951, RangeIndex(start=0, stop=7, step=1))

In [10]:
l = ['a', 'b', 'c']

But, in contrast to list, we can explicitly define the index

In [17]:
g7_pop.index = ['Canada', 'France', 'Germany', 'Italy', 'Japan', 'UK', 'US']
g7_pop

Canada      35.467
France      63.951
Germany     80.940
Italy       60.665
Japan      127.061
UK          64.511
US         318.523
Name: G7 Population in millions, dtype: float64

We can actually create Series out of dictionaries:

In [18]:
pd.Series ({'Canada': 35.467,
           'France': 63.951,
           'Gernamy': 80.94,
           'Italy': 60.665,
           'Japan': 127.061,
           'UK': 63.511,
           'US': 318.523}, name = 'G7 Population in million')

Canada      35.467
France      63.951
Gernamy     80.940
Italy       60.665
Japan      127.061
UK          63.511
US         318.523
Name: G7 Population in million, dtype: float64

In [19]:
pd.Series([35.467, 63.951, 80.940, 60.665, 127.061, 63.511, 318.523],
          index = ['Canada', 'France', 'Germany', 'Italy', 'Japan', 'UK', 'US'],
          name = 'G7 Population in million')

Canada      35.467
France      63.951
Germany     80.940
Italy       60.665
Japan      127.061
UK          63.511
US         318.523
Name: G7 Population in million, dtype: float64

In [24]:
pd.Series(g7_pop, index = ['France', 'Germany', 'Italy', 'Spain'])

France     63.951
Germany    80.940
Italy      60.665
Spain         NaN
Name: G7 Population in millions, dtype: float64

### Indexing

Indexing works similar to dictionary, you can use the index to find the element that you're looking for:

In [40]:
g7_pop

Canada      35.467
France      63.951
Germany     80.940
Italy       60.665
Japan      127.061
UK          64.511
US         318.523
Name: G7 Population in millions, dtype: float64

In [26]:
g7_pop['Germany'], g7_pop['UK']

(80.94, 64.511)

Numeric can be used, with the iloc attribute:

In [38]:
g7_pop.iloc[0], g7_pop.iloc[1], g7_pop.iloc[-1]

(35.467, 63.951, 318.523)

Selecting multiple element at once:

In [39]:
g7_pop[['Germany', 'UK', 'US']],g7_pop[[0,1,4,5]] # note that you have to use double : [[]], and the result is another series.

(Germany     80.940
 UK          64.511
 US         318.523
 Name: G7 Population in millions, dtype: float64,
 Canada     35.467
 France     63.951
 Japan     127.061
 UK         64.511
 Name: G7 Population in millions, dtype: float64)

Slicing is also work, but **important**, in pandas, the upper limit is also included

In [41]:
g7_pop['Canada': 'US']

Canada      35.467
France      63.951
Germany     80.940
Italy       60.665
Japan      127.061
UK          64.511
US         318.523
Name: G7 Population in millions, dtype: float64

### Conditional selection (boolean arrays)
Same as numpy arrays

In [43]:
g7_pop

Canada      35.467
France      63.951
Germany     80.940
Italy       60.665
Japan      127.061
UK          64.511
US         318.523
Name: G7 Population in millions, dtype: float64

In [62]:
g7_pop > 70, g7_pop[g7_pop>70], g7_pop.mean(), g7_pop[g7_pop>g7_pop.mean()]

(Canada     False
 France     False
 Germany     True
 Italy      False
 Japan       True
 UK         False
 US          True
 Name: G7 Population in millions, dtype: bool,
 Germany     80.940
 Japan      127.061
 US         318.523
 Name: G7 Population in millions, dtype: float64,
 107.30257142857144,
 Japan    127.061
 US       318.523
 Name: G7 Population in millions, dtype: float64)

~ : not

| : or

& : and

In [64]:
g7_pop.std(), g7_pop[(g7_pop> g7_pop.mean() - g7_pop.std()) | (g7_pop > g7_pop.mean() + g7_pop.std())]

(97.24996987121581,
 Canada      35.467
 France      63.951
 Germany     80.940
 Italy       60.665
 Japan      127.061
 UK          64.511
 US         318.523
 Name: G7 Population in millions, dtype: float64)

### Operations and methods
Series also support vetorized operations and aggregation functions as Numpy

In [57]:
g7_pop

Canada      35.467
France      63.951
Germany     80.940
Italy       60.665
Japan      127.061
UK          64.511
US         318.523
Name: G7 Population in millions, dtype: float64

In [59]:
g7_pop*1_000_000 # or 1000000

Canada      35467000.0
France      63951000.0
Germany     80940000.0
Italy       60665000.0
Japan      127061000.0
UK          64511000.0
US         318523000.0
Name: G7 Population in millions, dtype: float64

In [68]:
np.log(g7_pop), g7_pop[['Germany', 'US']].mean(), g7_pop['Germany':'UK'].mean()

(Canada     3.568603
 France     4.158117
 Germany    4.393708
 Italy      4.105367
 Japan      4.844667
 UK         4.166836
 US         5.763695
 Name: G7 Population in millions, dtype: float64,
 199.7315,
 83.29425)

### Modifying series

In [71]:
g7_pop['Canada'] = 40.5
g7_pop
g7_pop.iloc[-1] = 500
g7_pop

Canada      40.500
France      63.951
Germany     80.940
Italy       60.665
Japan      127.061
UK          64.511
US         500.000
Name: G7 Population in millions, dtype: float64

In [73]:
g7_pop[g7_pop<70] = 99.99
g7_pop

Canada      99.990
France      99.990
Germany     80.940
Italy       99.990
Japan      127.061
UK          99.990
US         500.000
Name: G7 Population in millions, dtype: float64