### Pandas DataFrame

For this activity, we can continue in the notebook from the previous activity. If you decide to create a new one, don't forget to import the packages.

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects. It is generally the most commonly used Pandas object. Like Series, DataFrame accepts many different kinds of input:

    dictionary of 1D ndarrays, lists, dictionaries, or Series
    2D NumPy nd.array
    Series
    DataFrame

## From Dictionary of Series or Dictionaries

The resulting index will be the union of the indexes of the various Series.

In [1]:
import numpy as np
import pandas as pd

In [2]:
d = { 'one' : pd.Series([1.,2.,3.], index = ['a','b','c']),
      'two' : pd.Series([1.,2.,3.,4.], index = ['a','b','c','d'])}

In [3]:
df = pd.DataFrame(d)

In [4]:
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [6]:
pd.DataFrame(d, index =['d','b','a'])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [7]:
pd.DataFrame(d, index = ['d','b','a'], columns = ['two','three'])

Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,


When data is a dictionary, and a columns is not passed, the DataFrame columns will be ordered by the dictionary’s insertion order. There is no sorting if you have Python version >= 3.6 and Pandas version >= 0.23.

 Warning

If you pass an index and/or columns, you are guaranteeing the index and/or columns of the resulting DataFrame. Thus, a dictionary of Series plus a specific index and/or columns will discard all data not matching to the passed index and/or columns. See the last example above with an empty column labeled three.


## From Dictionary of ndarrays or lists

The ndarrays must all be of the same length. If an index is passed, it must also be of the same length as the arrays. If no index is passed, the result will be range(n), where n is the array length.

In [8]:
d = {'one': [1., 2., 3., 4.],
         'two': [4., 3., 2., 1.]}

In [9]:
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [10]:
pd.DataFrame(d, index = ['a','b','c','d'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


## From a Series

The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series (only if no other column name is provided).

In [11]:
pd.DataFrame(pd.Series(np.random.randn(5), name='something'))

Unnamed: 0,something
0,-1.701952
1,0.964794
2,-1.524894
3,0.762828
4,0.385856


## Column selection, addition, deletion

You can treat a DataFrame semantically like a dictionary of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dictionary operations:

In [12]:
df['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [13]:
df['three'] = df['one'] * df['two']

In [14]:
df['flag'] = df['one'] > 2 

In [15]:
df

Unnamed: 0,one,two,three,flag
a,1.0,1.0,1.0,False
b,2.0,2.0,4.0,False
c,3.0,3.0,9.0,True
d,,4.0,,False


Columns can be deleted like with a dictionary:

In [16]:
del df['two']

When inserting a scalar value, it will naturally be propagated to fill the column:

In [17]:
df['foo'] = 'bar'

In [18]:
df

Unnamed: 0,one,three,flag,foo
a,1.0,1.0,False,bar
b,2.0,4.0,False,bar
c,3.0,9.0,True,bar
d,,,False,bar


When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s index.

In [19]:
df['one_trunc'] = df['one'][:2]

In [20]:
df

Unnamed: 0,one,three,flag,foo,one_trunc
a,1.0,1.0,False,bar,1.0
b,2.0,4.0,False,bar,2.0
c,3.0,9.0,True,bar,
d,,,False,bar,


Operations with scalars are just as you would expect:

In [21]:
df = pd.DataFrame(np.random.randn(8, 3), index=range(8), columns=list('ABC'))

In [22]:
df * 5 + 2

Unnamed: 0,A,B,C
0,3.690502,2.171609,-4.220425
1,5.753734,3.344633,2.665939
2,7.581631,1.614369,4.291563
3,-4.691539,-3.506067,-9.186444
4,3.317372,4.397258,6.445923
5,-1.368738,-2.671318,-4.70964
6,0.362708,0.220374,8.291864
7,0.059089,1.363609,2.241731


In [23]:
1/df

Unnamed: 0,A,B,C
0,2.957701,29.13599,-0.803804
1,1.332007,3.718487,7.508197
2,0.895795,-12.96577,2.181917
3,-0.747212,-0.908089,-0.44697
4,3.795436,2.085716,1.124626
5,-1.484235,-1.070362,-0.745196
6,-3.053823,-2.809579,0.794677
7,-2.57611,-7.856805,20.684181


In [24]:
df ** 4

Unnamed: 0,A,B,C
0,0.013067,1e-06,2.395522
1,0.317668,0.00523,0.000315
2,1.552976,3.5e-05,0.044121
3,3.207923,1.470571,25.054645
4,0.004819,0.052842,0.625126
5,0.206058,0.761865,3.242776
6,0.011498,0.016049,2.507478
7,0.022706,0.000262,5e-06


Boolean operators are vectorized as well:

In [25]:
df1 = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1]}, dtype=bool)

In [26]:
df2 = pd.DataFrame({'a': [0, 1, 1], 'b': [1, 1, 0]}, dtype=bool)

In [27]:
df1 & df2

Unnamed: 0,a,b
0,False,False
1,False,True
2,True,False


In [28]:
df1 | df2

Unnamed: 0,a,b
0,True,True
1,True,True
2,True,True


In [29]:
df1 ^ df2

Unnamed: 0,a,b
0,True,True
1,True,False
2,False,True


In [30]:
-df1

Unnamed: 0,a,b
0,False,True
1,True,False
2,False,False
