# Pandas DataFrame
**DataFrame** is a 2d labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a *dictionary* of **Series** objests. It is generally the most commonly used Pandas object. Like *Series*, *DataFrame* accepts many different kinds of input:

* dictionary of 1D arrays, lists, dictionaries or Series
* 2D NumPy nd.array
* Series
* DataFrame

In [1]:
import pandas as pd
import numpy as np

## From Dictionary of Series or Dictionaries

The resulting index will be the union of the indexes of various Series.

In [2]:
d = {'one': pd.Series([1., 2., 3.], index=['a','b','c',]), 'two': pd.Series([1.,2.,3.,4.], index=['a','b','c','d'])}

In [3]:
df = pd.DataFrame(d)

In [4]:
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [5]:
pd.DataFrame(d, index=['d','b','a'])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [6]:
pd.DataFrame(d, index=['d','b','a'], columns=['two','three'])

Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,


When data is a *dictionary*, and a *columns* is not passed, the DataFrame columns will be ordered by the *dictionary’s* insertion order. There is no sorting if you have Python version >= 3.6 and Pandas version >= 0.23.

## From Dictionary of *ndarrays* or lists

The **ndarrays** must all be of the same length. If an index is passed, it *must* also be of the same length as the arrays. If no index is passed, the result will be **range(n)**, where n is the array length.

In [7]:
d = {'one': [1., 2., 3., 4.],
         'two': [4., 3., 2., 1.]}

In [8]:
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [9]:
pd.DataFrame(d, index=['a','b','c','d'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


## From a Series
The result will be a **DataFrame** with the same index as the input *Series*, and with one column whose name is the original name of the *Series* (only if no other column name is provided.)

In [10]:
pd.DataFrame(pd.Series(np.random.randn(5), name='something'))

Unnamed: 0,something
0,-0.99886
1,-0.95117
2,-0.46339
3,0.163716
4,-0.659096


## Column selection, addition, deletion

You can treat a **DataFrame** semantically like a dictionary of like-indexed *Series* objects. Getting, setting, and deleting columns works with the same syntax as the anagolous dictionary operations:

In [11]:
df['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [12]:
df['three'] = df['one'] * df['two']

In [13]:
df['flag'] = df['one'] > 2

In [14]:
df

Unnamed: 0,one,two,three,flag
a,1.0,1.0,1.0,False
b,2.0,2.0,4.0,False
c,3.0,3.0,9.0,True
d,,4.0,,False


Columns can be deleted like with a dictionary

In [15]:
del df['two']

When inserting a scalar value, it will naturally be propogated to fill the whole column:

In [16]:
df['foo'] = 'bar'

In [17]:
df

Unnamed: 0,one,three,flag,foo
a,1.0,1.0,False,bar
b,2.0,4.0,False,bar
c,3.0,9.0,True,bar
d,,,False,bar


When inserting a **Series** that does not have the same index as the **DataFrame**, it will be conformed to the **DataFrames** index.

In [18]:
df['one_trunc'] = df['one'][:2]
df

Unnamed: 0,one,three,flag,foo,one_trunc
a,1.0,1.0,False,bar,1.0
b,2.0,4.0,False,bar,2.0
c,3.0,9.0,True,bar,
d,,,False,bar,


Operations with scalars are just as you would expect:

In [21]:
df = pd.DataFrame(np.random.randn(8, 3), columns=list('ABC'))

In [22]:
df * 5 + 2

Unnamed: 0,A,B,C
0,2.238952,2.821008,7.712536
1,-1.41365,1.397332,3.556755
2,13.690385,3.076847,-14.399979
3,-0.675401,-7.231158,6.444319
4,-0.611044,-0.10906,4.151468
5,9.911327,4.207204,-1.38102
6,-2.06193,1.10399,7.354967
7,4.121492,-7.208695,-2.305344


In [23]:
1/df

Unnamed: 0,A,B,C
0,20.924668,6.090073,0.875268
1,-1.464708,-8.296437,3.21181
2,0.427702,4.643185,-0.304878
3,-1.868879,-0.541644,1.125032
4,-1.914943,-2.370724,2.323995
5,0.632005,2.26531,-1.478844
6,-1.230942,-5.580297,0.933713
7,2.356831,-0.542965,-1.161347


In [24]:
df ** 4

Unnamed: 0,A,B,C
0,5e-06,0.000727,1.703867
1,0.217268,0.000211,0.009397
2,29.883759,0.002151,115.742569
3,0.081974,11.618356,0.624225
4,0.074366,0.031657,0.034281
5,6.267833,0.037974,0.209079
6,0.435562,0.001031,1.31567
7,0.032411,11.505681,0.549733


Boolean Operators are vectorized as well:

In [25]:
df1 = pd.DataFrame({'a': [1,0,1], 'b':[0,1,1]}, dtype=bool)

In [26]:
df2 = pd.DataFrame({'a': [0,1,1], 'b':[1,1,0]}, dtype=bool)

In [27]:
df1 & df2

Unnamed: 0,a,b
0,False,False
1,False,True
2,True,False


In [28]:
df1 | df2

Unnamed: 0,a,b
0,True,True
1,True,True
2,True,True


In [29]:
df1 ^ df2

Unnamed: 0,a,b
0,True,True
1,True,False
2,False,True


In [30]:
-df1

Unnamed: 0,a,b
0,False,True
1,True,False
2,False,False
