## Basic data structures in pandas

1) series:  a 1-D labeled array holding data of any type

2) DataFrame: a 2-D data structure that holds data like a two-dimension array or a table with rows and columns.

Fundamentally, data alignment is intrinsic. The link between labels and data will not be brokken unless done so explicitly by you.

In [2]:
import numpy as np
import pandas as pd

## Series
It is a 1-D labeled array capable of holding any data type. The axis labels are collectively referred to as the INDEX. 
The basic method to create a seires is to call:

s = pd.Series(data, index=index)

In [3]:
s = pd.Series(np.random.randn(5), index=["a","b","c","d","e"])
s

a   -0.724029
b   -0.100432
c    1.492483
d   -0.516404
e   -1.488591
dtype: float64

In [4]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [5]:
pd.Series(np.random.randn(5))

0   -1.252210
1    0.016021
2   -1.042012
3    0.007696
4    2.499647
dtype: float64

#### From dict

In [6]:
d = {"b":1, "a":0, "c":2}
pd.Series(d)

b    1
a    0
c    2
dtype: int64

In [9]:
pd.Series(d, index=["b","c","d","a"])
# NaN (not a number) is the standard missing data marker used in pandas.

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

#### From scalar value
If "data" is a scalar value, an index must be provided. The value will be repeated to match the length of Index.

In [10]:
pd.Series(5.0, index=["a","b","c","d","e"])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

#### Series is ndarray-like
Series acts very similarly to a ndarray and is a valid argument to most NumPy functions. However, operations such as slicing will also slice the index

In [11]:
s.iloc[0]

-0.724029191094619

In [12]:
s.iloc[:3]

a   -0.724029
b   -0.100432
c    1.492483
dtype: float64

In [13]:
s[s > s.median()]

b   -0.100432
c    1.492483
dtype: float64

In [14]:
np.exp(s)

a    0.484795
b    0.904446
c    4.448126
d    0.596662
e    0.225690
dtype: float64

ExtensionArray is a thin wrapper around one or more concrete arrays like a numpy.ndarray. pandas knows how to take an ExtensionArray and store it in a Series or a column of a DataFrame.

In [16]:
s.array

<PandasArray>
[  -0.724029191094619, -0.10043224338408258,   1.4924829609613528,
   -0.516403713900107,  -1.4885908228115756]
Length: 5, dtype: float64

Even if the "Series" is backed by a "ExtensionArray", "Series.to_numpy()" will return a Numpy ndarray

In [18]:
pd.Series.to_numpy(s)

array([-0.72402919, -0.10043224,  1.49248296, -0.51640371, -1.48859082])

#### Series is dict-like
A "series" is also like a fixed-size dict in that you can get and set values by index label:

In [19]:
s["a"]

-0.724029191094619

In [20]:
s["e"]

-1.4885908228115756

In [21]:
"e" in s

True

In [23]:
s["f"] # exception is raised.

KeyError: 'f'

Using the "Series.get()" method, a missing label will return None or specified defalut:

In [24]:
s.get("f")

In [25]:
s.get("f", np.nan)

nan

## Vectorized operations and label alihnment with Series
When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same is true when working with "Series" in pandas.

In [26]:
s + s

a   -1.448058
b   -0.200864
c    2.984966
d   -1.032807
e   -2.977182
dtype: float64

In [27]:
s * 2

a   -1.448058
b   -0.200864
c    2.984966
d   -1.032807
e   -2.977182
dtype: float64

A key difference between "Sereis" and ndarray is that operations between "Series" automatically align the data based on label. Thus, you can write computations without giving consideration to whether the "Series" involved have the same labels.

In [28]:
s + s

a   -1.448058
b   -0.200864
c    2.984966
d   -1.032807
e   -2.977182
dtype: float64

In [30]:
s.iloc[1:] + s.iloc[:-1]
# The result of an operation between unaligned Series will have the union of the # indexes involved. If a label is not found in one Series or the other, the result # will be marked as missing NaN.

a         NaN
b   -0.200864
c    2.984966
d   -1.032807
e         NaN
dtype: float64

#### Name attribute

In [31]:
s = pd.Series(np.random.randn(5), name="physics")
s

0   -1.698383
1   -0.109823
2   -0.148843
3    0.349051
4   -0.094739
Name: physics, dtype: float64

In [32]:
s.name

'physics'

You can rename a Series with the pandas.Series.rename() method.

In [34]:
s2 = s.rename("maths")
s2.name
# "s" and "s2" refer to different objects

'maths'

## DataFrame
DataFrame is a 2-D labeled data structure with columns of potentially different types. Along with the data, you can optionally pass index (row labels) and columns (cliumn labels) arguments.If you pass an index and/or columns, you are guaranteeing the index and/or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

#### From dict of Series or dicts
The resulting index will be the union of the indexes of the various Series. If there are any nasted dicts, these will first be converted to series. If no columns are passed, the columns will be ordered list of dict keys

In [39]:
d = {
    "one": pd.Series([1,2,3], index=["a","b","c"]),
    "two": pd.Series([1,2,3,4], index=["a","b","c","d"])
}

df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


In [40]:
pd.DataFrame(d, index=["d","b","a"])

Unnamed: 0,one,two
d,,4
b,2.0,2
a,1.0,1


In [41]:
pd.DataFrame(d, index=["d","b","a"], columns=["two","three"])

Unnamed: 0,two,three
d,4,
b,2,
a,1,


In [42]:
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [43]:
df.columns

Index(['one', 'two'], dtype='object')

#### From dict of ndarrays/lists
All ndarrays must share the same length. If an index is passed, it must also be the same length as the arrays. If no index is passed, the result will be "range(n)", where "n" is the array length.

In [45]:
d = {
    "one": [1.0,2.0,3.0,4.0],
    "two": [4.0,3.0,2.0,1.0]
}

pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [46]:
pd.DataFrame(d, index=["a","b","c","d"])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0
