##### Pandas

1. Series
2. DataFrame

In [None]:
import pandas as pd
import numpy as np

##### Pandas Series

We'll start analyzing "The Group of Seven". Which is a political formed by Canada, France, Germany, Italy, Japan, the United Kingdom and the United States. We'll start by analyzing population, and for that, we'll use a pandas.Series object.


In [None]:
# In millions

population = pd.Series([35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523])
pop=np.array([35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523])

In [None]:
pop

array([ 35.467,  63.951,  80.94 ,  60.665, 127.061,  64.511, 318.523])

In [None]:
population

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
dtype: float64

In [None]:
# Series can have a name, to better document the purpose of the Series:
population.name = 'G7 Population in millions'

In [None]:
population

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in millions, dtype: float64

In [None]:
population.dtype

dtype('float64')

In [None]:
population.values

array([ 35.467,  63.951,  80.94 ,  60.665, 127.061,  64.511, 318.523])

In [None]:
#They're actually backed by numpy arrays:
type(population.values)

numpy.ndarray

Pandas Series look like simple Python lists or Numpy Arrays. But they're actually more similar to Python dicts.

A Series has an index, that's similar to the automatic index assigned to Python's lists


In [None]:
population

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
dtype: float64

In [None]:
population[0]

35.467

In [None]:
population[1]

63.951

In [None]:
population.index

RangeIndex(start=0, stop=7, step=1)

But, in contrast to lists, we can explicitly define the index:

In [None]:
population.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]

In [None]:
population

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
dtype: float64

We can say that Series look like "ordered dictionaries". We can actually create Series out of dictionaries:

In [None]:
pd.Series({
    'Canada': 35.467,
    'France': 63.951,
    'Germany': 80.94,
    'Italy': 60.665,
    'Japan': 127.061,
    'United Kingdom': 64.511,
    'United States': 318.523
}, name='G7 Population in millions')

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [None]:
pd.Series(
    [35.467, 63.951, 80.94, 60.665, 127.061, 64.511, 318.523],
    index=['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom',
       'United States'],
    name='G7 Population in millions')

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [None]:
population

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
dtype: float64

In [None]:
pd.Series(population, index=['Germany', 'Italy', 'Spain','France'])

Germany    80.940
Italy      60.665
Spain         NaN
France     63.951
dtype: float64

##### Indexing

Indexing works similarly to lists and dictionaries, you use the index of the element you're looking for:

In [None]:
population

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
dtype: float64

In [None]:
pd.Series(population, index=['Canada'])

Canada    35.467
dtype: float64

In [None]:
population['Canada']

35.467

In [None]:
population['Japan']

127.061

Numeric positions can also be used, with the iloc attribute:

In [None]:
population.iloc[0:4]

Canada     35.467
France     63.951
Germany    80.940
Italy      60.665
dtype: float64

In [None]:
population.iloc[-1]

318.523

Selecting multiple elements at once:

In [None]:
population[['Italy', 'France']]

Italy     60.665
France    63.951
Name: G7 Population in millions, dtype: float64

Slicing also works, but important, in Pandas, the upper limit is also included:

In [None]:
population['Canada': 'Italy']

Canada     35.467
France     63.951
Germany    80.940
Italy      60.665
Name: G7 Population in millions, dtype: float64

##### Conditional selection (boolean arrays)

In [None]:
population

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [None]:
population > 70

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: G7 Population in millions, dtype: bool

In [None]:
population.mean()

107.30257142857144

In [None]:
population.std()

97.24996987121581

##### Operations and methods

Series also support vectorized operations and aggregation functions as Numpy:

In [None]:
population * 1_000_000

Canada             35467000.0
France             63951000.0
Germany            80940000.0
Italy              60665000.0
Japan             127061000.0
United Kingdom     64511000.0
United States     318523000.0
Name: G7 Population in millions, dtype: float64

In [None]:
np.log(population)

Canada            3.568603
France            4.158117
Germany           4.393708
Italy             4.105367
Japan             4.844667
United Kingdom    4.166836
United States     5.763695
Name: G7 Population in millions, dtype: float64

In [None]:
population

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
dtype: float64

In [None]:
population['France': 'Italy'].mean()

68.51866666666666

In [None]:
population['Canada': 'Germany'].sum()

185.391

##### Modifying series

In [None]:
population['Canada'] = 40.5
population

Canada             40.500
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
dtype: float64

In [None]:
population.iloc[-1] = 500

In [None]:
population

Canada             40.500
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     500.000
dtype: float64