# Pandas Series

In [1]:
import numpy as np
import pandas as pd

`Series` is a 1D labeled array capable of holding an y data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

In [3]:
s = pd.Series(['data'], index=['index'])

`data` can be different things:
- a Python dict
- an ndarry
- a scalar value

The pass `index` is a list of axis labels.

### From `ndarray`
If `data` is an `ndarray`, an index must be the same length as the data. In o index is passed, one will be created having values `[0, ..., len(data) - 1]`.

In [4]:
# specify the index
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -2.035138
b   -0.057130
c   -0.124906
d    1.067390
e   -1.470917
dtype: float64

In [5]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [7]:
# Pandas can also create a default index
pd.Series(np.random.rand(5))

0    0.505430
1    0.649689
2    0.141526
3    0.328998
4    0.425865
dtype: float64

### From `dict`
Series can be created from `dicts`.

In [8]:
d = {'b': 1, 'a': 0, 'c': 2}
pd.Series(d)

b    1
a    0
c    2
dtype: int64

When data is a `dict` and an `index` is not passed, the `Series` index will be ordered by the `dict`'s insertion order. There is no sorting if you have `Python` version >= 3.6 and `Pandas` version >= 0.23.

### From `scalar` value
If data is a `scalar` value, an `index` must be provided. The value will be repeated to match the length of the `index`.

In [9]:
pd.Series(5, index=['a', 'b', 'c', 'd', 'e'])

a    5
b    5
c    5
d    5
e    5
dtype: int64

## Series is `ndarray`-like
`Series` acts very similarly to `ndarray` from `numpy` and is a valid argument to most `numpy` functions. Operations such as slicing will also slice the index.

In [10]:
s[0]

-2.035137579647807

In [11]:
s[:3]

a   -2.035138
b   -0.057130
c   -0.124906
dtype: float64

In [12]:
s[s > s.median()]

b   -0.05713
d    1.06739
dtype: float64

In [13]:
s[[4, 3, 1]]

e   -1.470917
d    1.067390
b   -0.057130
dtype: float64

In [14]:
np.exp(s)

a    0.130663
b    0.944471
c    0.882580
d    2.907780
e    0.229715
dtype: float64

In [15]:
s.dtype

dtype('float64')

In [16]:
s.to_numpy()

array([-2.03513758, -0.05713035, -0.12490551,  1.06738982, -1.47091712])

## Series is `dict`-like
A `Series` is like a fixed-size `dict` in which you can get and set values by an `index` label.

In [17]:
s['a']

-2.035137579647807

In [20]:
s['e'] = 12
s

a    -2.035138
b    -0.057130
c    -0.124906
d     1.067390
e    12.000000
dtype: float64

In [21]:
'e' in s

True

In [22]:
'f' in s

False

In [28]:
try:
  s['f']
except KeyError:
  print('KeyError: \'f\'')

KeyError: 'f'


## Vectorized operations
When working with raw `numpy` arrays, looping through value-by-value is usually not necessary. The same is true when working with `Series` in `Pandas`. `Series` can also be passed into most `numpy` methods expecting an `ndarray`.

In [29]:
s + s

a    -4.070275
b    -0.114261
c    -0.249811
d     2.134780
e    24.000000
dtype: float64

In [31]:
s * 2

a    -4.070275
b    -0.114261
c    -0.249811
d     2.134780
e    24.000000
dtype: float64

In [32]:
np.exp(s)

a         0.130663
b         0.944471
c         0.882580
d         2.907780
e    162754.791419
dtype: float64

A key difference between `Series` and `ndaray` is that operations between `Series` automatically align data based on the label. Thus, you can write computations without considering whether the `Series` involved have the same labels.

In [33]:
s1 = s[1:]
s2 = s[:-1]
s1 + s2

a         NaN
b   -0.114261
c   -0.249811
d    2.134780
e         NaN
dtype: float64

The result of an operation between unaligned `Series` will have the union of the indexes involved. If a label is not found in one `Series` or the other, the result will be marked as missing `NaN`.

## Name attribute
`Series` can also have a `name` attribute.

In [34]:
s = pd.Series(np.random.randn(5), name='something')
s

0   -0.946825
1    1.216063
2    0.123337
3    0.878028
4    0.984506
Name: something, dtype: float64

In [35]:
s.name

'something'

# Pandas DataFrame

`DataFrame` is a 2D labeled data structure with columns of potentially different types. You can think of it like a spreadsheet of SQL table, or a `dictionary` of `Series` objects. It is generally the most commonly used Pandas object.

Like `Series`, `DataFrame` accepts many different kinds of inputs:
- `dictionary` of 1D `ndarrays`, `lists`, `dictionaries`, or `Series`
- 2D `ndarray`
- `Series`
- `DataFrame`

## From `dict` of `Series` or `dicts`
The resulting index will be the union of the indices of the varies `Series`.

In [36]:
d = {'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two': pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


In [37]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4
b,2.0,2
a,1.0,1


In [38]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4,
b,2,
a,1,


When data is a `dictionary`, and a `columns` is not passed, the `DataFrame` columns will be ordered by the `dictionary`'s insertion order. There is no sorting if you have Python version >= 3.6 and Pandas version >= 0.23.

If an `index` and/or `columns` are passed, they are guaranteed to be placed inside the `DataFrame`. If a `dict` of `Series` does not contain the specified `indices` and/or `columns`. the `index` and/or `column` will be populated with empty data.

## From `dict` of `ndarrays` or `lists`
The `ndarrays` must all be of the same length. If an `index` is passed, it must also be of the same length as the arrays. If no `index` is passed, the result will be `range(n)`, where `n` is the array length.

In [40]:
d = {'one': [1, 2, 3, 4],
     'two': [4, 3, 2, 1]}

pd.DataFrame(d)

Unnamed: 0,one,two
0,1,4
1,2,3
2,3,2
3,4,1


In [41]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1,4
b,2,3
c,3,2
d,4,1


## From a `Series`
The result will be a `DataFrame` with the same `index` as the input `Series`, and with one column whose name is the original name of the `Series` (only if no other `column` name is provided).

In [42]:
pd.DataFrame(pd.Series(np.random.rand(5), name='something'))

Unnamed: 0,something
0,0.324519
1,0.013499
2,0.559761
3,0.474042
4,0.420258


## Column selection, addition, deletion
`DataFrame` can be treated like a `dict` of similarly indexed `Series` objects. Getting, setting, and deleting columns works with the same syntax as the analogous `dict` operations.

In [43]:
df['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [44]:
df['three'] = df['one'] * df['two']
df['flag'] = df['one'] > 2
df

Unnamed: 0,one,two,three,flag
a,1.0,1,1.0,False
b,2.0,2,4.0,False
c,3.0,3,9.0,True
d,,4,,False


Columns can be deleted as if it were a `dict`.

In [45]:
del df['two']

When inserting a `scalar` value, it will naturally generate a column if the column doesn't exist and it will fill all rows with the `scalar` value passed.

In [46]:
df['foo'] = 'bar'
df

Unnamed: 0,one,three,flag,foo
a,1.0,1.0,False,bar
b,2.0,4.0,False,bar
c,3.0,9.0,True,bar
d,,,False,bar


When inserting a `Series` that does not have the same index as the `DataFrame`, it will be conformed to fit the `DataFrame` index.

In [49]:
df['one_trunc'] = df['one'][:2]
df

Unnamed: 0,one,three,flag,foo,one_trunc
a,1.0,1.0,False,bar,1.0
b,2.0,4.0,False,bar,2.0
c,3.0,9.0,True,bar,
d,,,False,bar,


In [50]:
df = pd.DataFrame(np.random.randn(8, 3), index=range(8), columns=list('ABC'))
df * 5 + 2

Unnamed: 0,A,B,C
0,2.53878,-1.09945,-0.041817
1,2.166033,7.439312,5.148903
2,-5.046813,-4.955278,6.38511
3,-1.016415,3.700909,2.202543
4,0.927362,11.554891,-2.110711
5,-3.295359,6.954279,-3.439368
6,-7.453045,2.712912,11.546576
7,4.193352,-3.779026,6.586271


In [52]:
1 / df

Unnamed: 0,A,B,C
0,9.280226,-1.613189,-2.448799
1,30.11453,0.919234,1.587855
2,-0.709541,-0.718879,1.140222
3,-1.657597,2.939604,24.686175
4,-4.661403,0.523292,-1.216335
5,-0.944223,1.009229,-0.919224
6,-0.52893,7.013493,0.523748
7,2.279616,-0.865198,1.09021


In [53]:
df ** 4

Unnamed: 0,A,B,C
0,0.000135,0.147659,0.027809
1,1e-06,1.400541,0.15731
2,3.945398,3.744362,0.591619
3,0.13246,0.013392,3e-06
4,0.002118,13.33592,0.456865
5,1.258061,0.963922,1.400599
6,12.776352,0.000413,13.289559
7,0.03703,1.784591,0.707879


In [57]:
df1 = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1]}, dtype=bool)
df2 = pd.DataFrame({'a': [0, 1, 1], 'b': [1, 1, 0]}, dtype=bool)

In [59]:
df1

Unnamed: 0,a,b
0,True,False
1,False,True
2,True,True


In [60]:
df2

Unnamed: 0,a,b
0,False,True
1,True,True
2,True,False


In [58]:
df1 & df2

Unnamed: 0,a,b
0,False,False
1,False,True
2,True,False
