# Pandas Series

In [79]:
import numpy as np
import pandas as pd

`Series` is a 1D labeled array capable of holding an y data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

In [80]:
s = pd.Series(['data'], index=['index'])

`data` can be different things:
- a Python dict
- an ndarry
- a scalar value

The pass `index` is a list of axis labels.

### From `ndarray`
If `data` is an `ndarray`, an index must be the same length as the data. In o index is passed, one will be created having values `[0, ..., len(data) - 1]`.

In [81]:
# specify the index
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -0.609646
b    1.252616
c    1.172667
d   -0.433587
e   -0.784907
dtype: float64

In [82]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [83]:
# Pandas can also create a default index
pd.Series(np.random.rand(5))

0    0.913766
1    0.790213
2    0.054953
3    0.804601
4    0.813180
dtype: float64

### From `dict`
Series can be created from `dicts`.

In [84]:
d = {'b': 1, 'a': 0, 'c': 2}
pd.Series(d)

b    1
a    0
c    2
dtype: int64

When data is a `dict` and an `index` is not passed, the `Series` index will be ordered by the `dict`'s insertion order. There is no sorting if you have `Python` version >= 3.6 and `Pandas` version >= 0.23.

### From `scalar` value
If data is a `scalar` value, an `index` must be provided. The value will be repeated to match the length of the `index`.

In [85]:
pd.Series(5, index=['a', 'b', 'c', 'd', 'e'])

a    5
b    5
c    5
d    5
e    5
dtype: int64

## Series is `ndarray`-like
`Series` acts very similarly to `ndarray` from `numpy` and is a valid argument to most `numpy` functions. Operations such as slicing will also slice the index.

In [86]:
s[0]

-0.6096458277495969

In [87]:
s[:3]

a   -0.609646
b    1.252616
c    1.172667
dtype: float64

In [88]:
s[s > s.median()]

b    1.252616
c    1.172667
dtype: float64

In [89]:
s[[4, 3, 1]]

e   -0.784907
d   -0.433587
b    1.252616
dtype: float64

In [90]:
np.exp(s)

a    0.543543
b    3.499487
c    3.230598
d    0.648180
e    0.456162
dtype: float64

In [91]:
s.dtype

dtype('float64')

In [92]:
s.to_numpy()

array([-0.60964583,  1.25261648,  1.17266714, -0.43358718, -0.78490685])

## Series is `dict`-like
A `Series` is like a fixed-size `dict` in which you can get and set values by an `index` label.

In [93]:
s['a']

-0.6096458277495969

In [94]:
s['e'] = 12
s

a    -0.609646
b     1.252616
c     1.172667
d    -0.433587
e    12.000000
dtype: float64

In [95]:
'e' in s

True

In [96]:
'f' in s

False

In [97]:
try:
  s['f']
except KeyError:
  print('KeyError: \'f\'')

KeyError: 'f'


## Vectorized operations
When working with raw `numpy` arrays, looping through value-by-value is usually not necessary. The same is true when working with `Series` in `Pandas`. `Series` can also be passed into most `numpy` methods expecting an `ndarray`.

In [98]:
s + s

a    -1.219292
b     2.505233
c     2.345334
d    -0.867174
e    24.000000
dtype: float64

In [99]:
s * 2

a    -1.219292
b     2.505233
c     2.345334
d    -0.867174
e    24.000000
dtype: float64

In [100]:
np.exp(s)

a         0.543543
b         3.499487
c         3.230598
d         0.648180
e    162754.791419
dtype: float64

A key difference between `Series` and `ndaray` is that operations between `Series` automatically align data based on the label. Thus, you can write computations without considering whether the `Series` involved have the same labels.

In [101]:
s1 = s[1:]
s2 = s[:-1]
s1 + s2

a         NaN
b    2.505233
c    2.345334
d   -0.867174
e         NaN
dtype: float64

The result of an operation between unaligned `Series` will have the union of the indexes involved. If a label is not found in one `Series` or the other, the result will be marked as missing `NaN`.

## Name attribute
`Series` can also have a `name` attribute.

In [102]:
s = pd.Series(np.random.randn(5), name='something')
s

0    2.485159
1    0.040272
2   -1.690702
3    1.521265
4    0.119336
Name: something, dtype: float64

In [103]:
s.name

'something'

# Pandas DataFrame

`DataFrame` is a 2D labeled data structure with columns of potentially different types. You can think of it like a spreadsheet of SQL table, or a `dictionary` of `Series` objects. It is generally the most commonly used Pandas object.

Like `Series`, `DataFrame` accepts many different kinds of inputs:
- `dictionary` of 1D `ndarrays`, `lists`, `dictionaries`, or `Series`
- 2D `ndarray`
- `Series`
- `DataFrame`

## From `dict` of `Series` or `dicts`
The resulting index will be the union of the indices of the varies `Series`.

In [104]:
d = {'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two': pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


In [105]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4
b,2.0,2
a,1.0,1


In [106]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4,
b,2,
a,1,


When data is a `dictionary`, and a `columns` is not passed, the `DataFrame` columns will be ordered by the `dictionary`'s insertion order. There is no sorting if you have Python version >= 3.6 and Pandas version >= 0.23.

If an `index` and/or `columns` are passed, they are guaranteed to be placed inside the `DataFrame`. If a `dict` of `Series` does not contain the specified `indices` and/or `columns`. the `index` and/or `column` will be populated with empty data.

## From `dict` of `ndarrays` or `lists`
The `ndarrays` must all be of the same length. If an `index` is passed, it must also be of the same length as the arrays. If no `index` is passed, the result will be `range(n)`, where `n` is the array length.

In [107]:
d = {'one': [1, 2, 3, 4],
     'two': [4, 3, 2, 1]}

pd.DataFrame(d)

Unnamed: 0,one,two
0,1,4
1,2,3
2,3,2
3,4,1


In [108]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1,4
b,2,3
c,3,2
d,4,1


## From a `Series`
The result will be a `DataFrame` with the same `index` as the input `Series`, and with one column whose name is the original name of the `Series` (only if no other `column` name is provided).

In [109]:
pd.DataFrame(pd.Series(np.random.rand(5), name='something'))

Unnamed: 0,something
0,0.03104
1,0.565898
2,0.77425
3,0.674369
4,0.174042


## Column selection, addition, deletion
`DataFrame` can be treated like a `dict` of similarly indexed `Series` objects. Getting, setting, and deleting columns works with the same syntax as the analogous `dict` operations.

In [110]:
df['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [111]:
df['three'] = df['one'] * df['two']
df['flag'] = df['one'] > 2
df

Unnamed: 0,one,two,three,flag
a,1.0,1,1.0,False
b,2.0,2,4.0,False
c,3.0,3,9.0,True
d,,4,,False


Columns can be deleted as if it were a `dict`.

In [112]:
del df['two']

When inserting a `scalar` value, it will naturally generate a column if the column doesn't exist and it will fill all rows with the `scalar` value passed.

In [113]:
df['foo'] = 'bar'
df

Unnamed: 0,one,three,flag,foo
a,1.0,1.0,False,bar
b,2.0,4.0,False,bar
c,3.0,9.0,True,bar
d,,,False,bar


When inserting a `Series` that does not have the same index as the `DataFrame`, it will be conformed to fit the `DataFrame` index.

In [114]:
df['one_trunc'] = df['one'][:2]
df

Unnamed: 0,one,three,flag,foo,one_trunc
a,1.0,1.0,False,bar,1.0
b,2.0,4.0,False,bar,2.0
c,3.0,9.0,True,bar,
d,,,False,bar,


In [115]:
df = pd.DataFrame(np.random.randn(8, 3), index=range(8), columns=list('ABC'))
df * 5 + 2

Unnamed: 0,A,B,C
0,3.980163,-10.040917,-3.222317
1,-4.944181,-4.044841,4.376708
2,10.394835,14.772822,1.492035
3,-1.035881,-2.991861,-1.215975
4,6.495362,9.0836,8.76308
5,2.490878,-2.465444,5.190736
6,1.948799,-5.069801,5.061417
7,12.23738,-1.480008,1.284076


In [116]:
1 / df

Unnamed: 0,A,B,C
0,2.525045,-0.415251,-0.957429
1,-0.720027,-0.827152,2.10375
2,0.595604,0.391456,-9.843193
3,-1.646968,-1.00163,-1.554738
4,1.112258,0.705856,0.739308
5,10.185836,-1.11971,1.567037
6,-97.654555,-0.707234,1.633231
7,0.488406,-1.436778,-6.983986


In [117]:
df ** 4

Unnamed: 0,A,B,C
0,0.02459934,33.632426,1.190071
1,3.720524,2.136286,0.051053
2,7.946369,42.586055,0.000107
3,0.1359124,0.993505,0.171148
4,0.6533992,4.028432,3.347327
5,9.289949e-05,0.636178,0.165838
6,1.099588e-08,3.997133,0.140543
7,17.57419,0.234661,0.00042


In [118]:
df1 = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1]}, dtype=bool)
df2 = pd.DataFrame({'a': [0, 1, 1], 'b': [1, 1, 0]}, dtype=bool)

In [119]:
df1

Unnamed: 0,a,b
0,True,False
1,False,True
2,True,True


In [120]:
df2

Unnamed: 0,a,b
0,False,True
1,True,True
2,True,False


In [121]:
df1 & df2 #AND

Unnamed: 0,a,b
0,False,False
1,False,True
2,True,False


In [122]:
df1 | df2 #OR

Unnamed: 0,a,b
0,True,True
1,True,True
2,True,True


In [123]:
df1 ^ df2 #XOR 

Unnamed: 0,a,b
0,True,True
1,True,False
2,False,True


In [124]:
-df1 #NOT

Unnamed: 0,a,b
0,False,True
1,True,False
2,False,False


# Pandas `dtypes`
For the most part, `pandas` uses `numpy` arrays and dtypes for `Series` or individual columns of a `DataFrame`. `numpy` provides support for `float`, `int`, `bool`, `timedelta64[ns]`, and `datetime64[ns]`.

However, `numpy` doesn't allow non-numeric data types, so `pandas` has to extend `numpy`'s type system in a few places.

| Data Type | Description | String Aliases |
| :- | :- |:- |
| Categorical |  `CategoricalDtype` | category |
| Nullable Integers | `Int64Dtype` | int |
| Strings |  `StringDtype` | string |
| Boolean | `BooleanDtype` | bool |
| Others |  `ObjectDtype` | object |

In [125]:
dft = pd.DataFrame({'A': np.random.rand(3),
                    'B': 1,
                    'C': 'foo',
                    'D': pd.Timestamp('20010102'),
                    'E': pd.Series([1.0] * 3).astype('float32'),
                    'F': False,
                    'G': pd.Series([1] * 3, dtype='int8')})

dft

Unnamed: 0,A,B,C,D,E,F,G
0,0.641606,1,foo,2001-01-02,1.0,False,1
1,0.680702,1,foo,2001-01-02,1.0,False,1
2,0.36673,1,foo,2001-01-02,1.0,False,1


In [126]:
dft.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

In [127]:
dft['A'].dtype

dtype('float64')

`pandas` has two ways of storing strings.
1. `object dtype`, which can hold any Python object, including strings.
2. `StringDtyle`, is a dedicated string type introduced in 2020 only in `pandas` version >= 1.0.0

## Conversion

The `astype()` method can explicitly convert `dtypes`. These return a copy even if the `dtype` was unchanged, unless the parameter `copy` is changed to `False`.

In [128]:
df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')
df1.dtypes

A    float32
dtype: object

In [129]:
df1 = df1.astype('float64')
df1.dtypes

A    float64
dtype: object

In [130]:
dft1 = pd.DataFrame({'a': [1, 0, 1], 'b': [4, 5, 6], 'c': [7, 8, 9]})
dft1.dtypes

a    int64
b    int64
c    int64
dtype: object

In [133]:
dft1 = dft1.astype({'a': 'bool', 'c': np.float64})
dft1.dtypes

a       bool
b      int64
c    float64
dtype: object