# <center><div style="width: 370px;"> ![Panel Data](pictures/Panel_Data.jpg)

# <center> Essential Basic Functionality

In [1]:
import numpy as np
import pandas as pd

Here we discuss a lot of the essential functionality common to the pandas data structures. To begin, let’s create some example objects:

In [2]:
index = pd.date_range("1/1/2000", periods=8)

s = pd.Series(np.random.randn(5), index=list('abcde'))

df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=['A', 'B', 'C'])

In [3]:
index

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08'],
              dtype='datetime64[ns]', freq='D')

In [4]:
s

a    0.897547
b   -0.702285
c   -0.587901
d    1.753794
e    1.073189
dtype: float64

In [5]:
df

Unnamed: 0,A,B,C
2000-01-01,0.064296,0.028073,-1.614341
2000-01-02,1.281628,-0.533059,1.903378
2000-01-03,-0.678067,-0.988866,1.106183
2000-01-04,0.075456,-0.633666,-0.071669
2000-01-05,-1.395109,-0.462662,-0.236143
2000-01-06,1.596633,-0.2648,1.330912
2000-01-07,1.742903,1.148552,-0.958059
2000-01-08,0.153352,0.130167,-0.633161


## Head and Tail

To view a small sample of a Series or DataFrame object, use the `head()` and `tail()` methods. The default number of elements to display is five, but you may pass a custom number.

In [6]:
df.head()

Unnamed: 0,A,B,C
2000-01-01,0.064296,0.028073,-1.614341
2000-01-02,1.281628,-0.533059,1.903378
2000-01-03,-0.678067,-0.988866,1.106183
2000-01-04,0.075456,-0.633666,-0.071669
2000-01-05,-1.395109,-0.462662,-0.236143


In [7]:
df.tail(2)

Unnamed: 0,A,B,C
2000-01-07,1.742903,1.148552,-0.958059
2000-01-08,0.153352,0.130167,-0.633161


## Attributes and underlying data

pandas objects have a number of attributes enabling you to access the metadata
- shape: gives the axis dimensions of the object, consistent with ndarray
- Axis labels
     - Series: index (only axis)
     - DataFrame: index (rows) and columns

> **Note:** these attributes can be safely assigned to!

In [8]:
df[:3]

Unnamed: 0,A,B,C
2000-01-01,0.064296,0.028073,-1.614341
2000-01-02,1.281628,-0.533059,1.903378
2000-01-03,-0.678067,-0.988866,1.106183


In [9]:
df.columns = [x.lower() for x in df.columns]

In [10]:
df.head(4)

Unnamed: 0,a,b,c
2000-01-01,0.064296,0.028073,-1.614341
2000-01-02,1.281628,-0.533059,1.903378
2000-01-03,-0.678067,-0.988866,1.106183
2000-01-04,0.075456,-0.633666,-0.071669


pandas objects (`Index`, `Series`, `DataFrame`) can be thought of as containers for arrays, which hold the actual data and do the actual computation. For many types, the underlying array is a `numpy.ndarray`. However, pandas and 3rd party libraries may extend NumPy’s type system to add support for custom arrays (see dtypes).

To get the actual data inside a `Index` or `Series`, use the .array property

In [11]:
s.array

<NumpyExtensionArray>
[ 0.8975466822115005, -0.7022845280972562, -0.5879008747809251,
  1.7537937494780629,  1.0731893198383422]
Length: 5, dtype: float64

In [12]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [13]:
s.index.array

<NumpyExtensionArray>
['a', 'b', 'c', 'd', 'e']
Length: 5, dtype: object

`array` will always be an `ExtensionArray`. The exact details of what an ExtensionArray is and why pandas uses them are a bit beyond the scope of this introduction. See `dtypes` for more.
If you know you need a NumPy array, use `to_numpy()` or `numpy.asarray()`.

In [14]:
s.to_numpy()

array([ 0.89754668, -0.70228453, -0.58790087,  1.75379375,  1.07318932])

In [15]:
np.asarray(s)

array([ 0.89754668, -0.70228453, -0.58790087,  1.75379375,  1.07318932])

## Flexible binary operations

With binary operations between pandas data structures, there are two key points of interest:



We will demonstrate how to manage these issues independently, though they can be handled simultaneously.

### Matching / broadcasting behavior

DataFrame has the methods `add()`, `sub()`, `mul()`, `div()` and related functions `radd()`, `rsub()`, ...for carrying out binary operations. For broadcasting behavior, Series input is of primary interest. Using these functions, you can use to either match on the index or columns via the axis keyword:


In [20]:
df = pd.DataFrame({
    "one": pd.Series(np.random.randn(3), index=list('abc')),
    "two": pd.Series(np.random.randn(4), index=list('abcd')),
    "three": pd.Series(np.random.randn(3), index=list('bcd'))
                    }
)

df

Unnamed: 0,one,two,three
a,-0.978203,-0.666148,
b,-2.019387,0.239563,-0.67283
c,0.852647,-2.432966,0.995024
d,,-0.716317,2.400598


In [21]:
row = df.iloc[1]

In [22]:
row

one     -2.019387
two      0.239563
three   -0.672830
Name: b, dtype: float64

In [26]:
column = df["two"]

In [27]:
column

a   -0.666148
b    0.239563
c   -2.432966
d   -0.716317
Name: two, dtype: float64

In [28]:
df.sub(row, axis=1)

Unnamed: 0,one,two,three
a,1.041183,-0.90571,
b,0.0,0.0,0.0
c,2.872034,-2.672529,1.667854
d,,-0.95588,3.073428


In [29]:
df.sub(column, axis=0)

Unnamed: 0,one,two,three
a,-0.312056,0.0,
b,-2.258949,0.0,-0.912393
c,3.285613,0.0,3.42799
d,,0.0,3.116915


Furthermore you can align a level of a MultiIndexed DataFrame with a Series.


In [31]:
df_co = df.copy()
df_co.index = pd.MultiIndex.from_tuples(
    [(1, "a"), (1, "b"), (1, "c"), (2, "d")], name=["first", "second"]
)

In [32]:
df_co

Unnamed: 0_level_0,Unnamed: 1_level_0,one,two,three
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,a,-0.978203,-0.666148,
1,b,-2.019387,0.239563,-0.67283
1,c,0.852647,-2.432966,0.995024
2,d,,-0.716317,2.400598


### Missing data / operations with fill values

In Series and DataFrame, the arithmetic functions have the option of inputting a fill_value, namely a value to substitute when at most one of the values at a location are missing. For example, when adding two DataFrame objects, you may wish to treat NaN as 0 unless both DataFrames are missing that value, in which case the result will be NaN (you can later replace NaN with some other value using fillna if you wish).

In [87]:
df1 = pd.DataFrame({
    "one": pd.Series(np.random.randn(3), index=list('abc')),
    "two": pd.Series(np.random.randn(4), index=list('abcd')),
    "three": pd.Series(np.random.randn(3), index=list('bcd')),
})

In [88]:
df1

Unnamed: 0,one,two,three
a,0.158138,0.424273,
b,-0.357927,-0.197747,-0.0567
c,-0.42606,-0.340604,0.094217
d,,-0.161206,-0.110049


In [89]:
df2 = pd.DataFrame({
    "one": pd.Series(np.random.randn(3), index=list('abc')),
    "two": pd.Series(np.random.randn(4), index=list('abcd')),
    "three": pd.Series(np.random.randn(4), index=list('abcd')),
})

In [90]:
df2

Unnamed: 0,one,two,three
a,-0.205188,0.144582,0.226431
b,0.685371,1.365321,0.991827
c,1.461,-1.054371,0.662248
d,,-1.274356,-1.48579


In [91]:
df1 + df2

Unnamed: 0,one,two,three
a,-0.047049,0.568855,
b,0.327443,1.167574,0.935127
c,1.03494,-1.394975,0.756465
d,,-1.435562,-1.595839


In [92]:
df1.add(df2, fill_value=0)

Unnamed: 0,one,two,three
a,-0.047049,0.568855,0.226431
b,0.327443,1.167574,0.935127
c,1.03494,-1.394975,0.756465
d,,-1.435562,-1.595839


## Boolean reductions


In [93]:
(df1 > 0).all()

one      False
two      False
three    False
dtype: bool

In [97]:
(df1 > 0).any()

one      True
two      True
three    True
dtype: bool

You can reduce to a final boolean value.

In [96]:
(df1 > 0).any().any()

True

You can test if a pandas object is `empty`, via the empty property.


In [99]:
df.empty

False

In [100]:
test = pd.DataFrame(columns=list("ABC"))

In [101]:
test

Unnamed: 0,A,B,C


In [102]:
test.empty

True

In [104]:
pd.DataFrame()

In [105]:
pd.DataFrame().empty

True

**Warning:** You might be tempted to do the following:

In [106]:
if df:
    pass

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [107]:
df1 and df2

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [109]:
df1.any().any() and df2.any().any()

True

In [110]:
df1.all().all() and df2.all().all()

True

## Comparing if objects are equivalent

Often you may find that there is more than one way to compute the same result. As a simple example, consider `df + df` and `df * 2`. To test that these two computations produce the same result, given the tools shown above, you might imagine using `(df + df == df * 2).all()`. But in fact, this expression is False:

In [111]:
df + df == df * 2

Unnamed: 0,one,two,three
a,True,True,False
b,True,True,True
c,True,True,True
d,False,True,True


In [112]:
(df + df == df * 2).all()

one      False
two       True
three    False
dtype: bool

In [113]:
(df + df == df * 2).any()

one      True
two      True
three    True
dtype: bool

Notice that the boolean DataFrame `df + df == df * 2` contains some False values! This is because NaNs do not compare as equals:

In [114]:
np.nan == np.nan

False

So, NDFrames (such as Series and DataFrames) have an `equals()` method for testing equality, with NaNs in corresponding locations treated as equal.

In [115]:
(df + df).equals(df * 2)


True

You can conveniently perform element-wise comparisons when comparing a pandas data structure with a scalar value:


In [116]:
pd.Series(["foo", "bar", "baz"]) == "foo"

0     True
1    False
2    False
dtype: bool

In [118]:
pd.Index(["foo", "bar", "baz"]) == "foo"

array([ True, False, False])

pandas also handles element-wise comparisons between different array-like objects of the same length:


In [124]:
pd.Series(["foo", "bar", "baz"]) == pd.Index(["foo", "bar", "qux"])

0     True
1     True
2    False
dtype: bool

In [125]:
np.array(["foo", "bar", "baz"]) == pd.Series(["foo", "bar", "qux"])

0     True
1     True
2    False
dtype: bool

Trying to compare `Index` or `Series` objects of different lengths will raise a ValueError:

In [126]:
try:
    pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])
except ValueError as e:
    print(f'ValueError: {e}')

ValueError: Can only compare identically-labeled Series objects


> ***Note*** that this is different from the NumPy behavior where a comparison can be broadcast:

In [128]:
np.array([1, 2, 3]) == np.array([2])

array([False,  True, False])

or it can return False if broadcasting can not be done:

In [129]:
np.array([1, 2, 3]) == np.array([1, 2])

ValueError: operands could not be broadcast together with shapes (3,) (2,) 

## Combining overlapping data sets


A problem occasionally arising is the combination of two similar data sets where values in one are preferred over the other. An example would be two data series representing a particular economic indicator where one is considered to be of “higher quality”. However, the lower quality series might extend further back in history or have more complete data coverage. As such, we would like to combine two DataFrame objects where missing values in one DataFrame are conditionally filled with like-labeled values from the other DataFrame. The function implementing this operation is `combine_first()`, which we illustrate:

In [131]:
df1 = pd.DataFrame(
    {"A": [1, np.nan, 3, 5, np.nan],
     "B": [np.nan, 2, 3, np.nan, 6]}
)

In [135]:
df2 = pd.DataFrame(
    {"A": [5, 2, 4, np.nan, 3, 7],
     "B": [np.nan, np.nan, 3, 4, 6, 8]}
)

In [136]:
df1

Unnamed: 0,A,B
0,5.0,
1,2.0,
2,4.0,3.0
3,,4.0
4,3.0,6.0
5,7.0,8.0


In [137]:
df2

Unnamed: 0,A,B
0,5.0,
1,2.0,
2,4.0,3.0
3,,4.0
4,3.0,6.0
5,7.0,8.0


In [138]:
df1.combine_first(df2)

Unnamed: 0,A,B
0,5.0,
1,2.0,
2,4.0,3.0
3,,4.0
4,3.0,6.0
5,7.0,8.0


## General DataFrame combine


The `combine_first()` method above calls the more general `DataFrame.combine()`. This method takes another DataFrame and a combiner function, aligns the input DataFrame and then passes the combiner function pairs of Series (i.e., columns whose names are the same).

So, for instance, to reproduce `ombine_first()` as above:

In [139]:
def combiner(x, y):
    return np.where(pd.isna(x), y, x)

In [140]:
df1.combine(df2, combiner)


Unnamed: 0,A,B
0,5.0,
1,2.0,
2,4.0,3.0
3,,4.0
4,3.0,6.0
5,7.0,8.0
