# Creating Data Frames

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it
like a spreadsheet or SQL table, or a dict of Series objects.

You can create a data frame using:
- Dict of 1D ndarrays, lists, dicts, or Series
- 2-D numpy.ndarray
- Structured or record ndarray
- A Series
- Another DataFrame

### Data Frame attributes
| T       | Transpose index and columns                                                                                       |   |
|---------|-------------------------------------------------------------------------------------------------------------------|---|
| at      | Fast label-based scalar accessor                                                                                  |   |
| axes    | Return a list with the row axis labels and column axis labels as the only members.                                |   |
| blocks  | Internal property, property synonym for as_blocks()                                                               |   |
| dtypes  | Return the dtypes in this object.                                                                                 |   |
| empty   | True if NDFrame is entirely empty [no items], meaning any of the axes are of length 0.                            |   |
| ftypes  | Return the ftypes (indication of sparse/dense and dtype) in this object.                                          |   |
| iat     | Fast integer location scalar accessor.                                                                            |   |
| iloc    | Purely integer-location based indexing for selection by position.                                                 |   |
| is_copy |                                                                                                                   |   |
| ix      | A primarily label-location based indexer, with integer position fallback.                                         |   |
| loc     | Purely label-location based indexer for selection by label.                                                       |   |
| ndim    | Number of axes / array dimensions                                                                                 |   |
| shape   | Return a tuple representing the dimensionality of the DataFrame.                                                  |   |
| size    | number of elements in the NDFrame                                                                                 |   |
| style   | Property returning a Styler object containing methods for building a styled HTML representation fo the DataFrame. |   |
| values  | Numpy representation of NDFrame                                                                                   |   |

In [1]:
import pandas as pd
import numpy as np

### Creating data frames from various data types
##### create data frame from Python dictionary

In [2]:
my_dictionary = {'a' : 45., 'b' : -19.5, 'c' : 4444}
print(my_dictionary.keys())
print(my_dictionary.values())


dict_keys(['a', 'b', 'c'])
dict_values([45.0, -19.5, 4444])


In [3]:
my_dictionary_df = pd.DataFrame(my_dictionary, index=['first', 'again'])
my_dictionary_df

Unnamed: 0,a,b,c
first,45.0,-19.5,4444
again,45.0,-19.5,4444


##### constructor without explicit index

In [4]:
cookbook_df = pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]})
cookbook_df

Unnamed: 0,AAA,BBB,CCC
0,4,10,100
1,5,20,50
2,6,30,-30
3,7,40,-50


##### constructor contains dictionary with Series as values

In [5]:
series_dict = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
               'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
series_df = pd.DataFrame(series_dict)
series_df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


##### dictionary of lists

In [6]:
produce_dict = {'veggies': ['potatoes', 'onions', 'peppers', 'carrots'],
                'fruits': ['apples', 'bananas', 'pineapple', 'berries']}
produce_dict

{'fruits': ['apples', 'bananas', 'pineapple', 'berries'],
 'veggies': ['potatoes', 'onions', 'peppers', 'carrots']}

In [7]:
pd.DataFrame(produce_dict)

Unnamed: 0,fruits,veggies
0,apples,potatoes
1,bananas,onions
2,pineapple,peppers
3,berries,carrots


##### list of dictionaries

In [8]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(data2)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


##### dictionary of tuples, with  multi index

In [9]:
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,a,b,c,a,b
A,B,4.0,1.0,5.0,8.0,10.0
A,C,3.0,2.0,6.0,7.0,
A,D,,,,,9.0


# Select, Add, Delete, Columns

In [10]:
import pandas as pd
import numpy as np

### dictionary like operations
##### dictionary selection with string index

In [11]:
cookbook_df = pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]})
cookbook_df['BBB']

0    10
1    20
2    30
3    40
Name: BBB, dtype: int64

##### arithmetic vectorized operation using string indices

In [12]:
cookbook_df['BBB'] * cookbook_df['CCC']

0    1000
1    1000
2    -900
3   -2000
dtype: int64

##### column deletion 

In [14]:
del cookbook_df['BBB']
cookbook_df

Unnamed: 0,AAA,CCC
0,4,100
1,5,50
2,6,-30
3,7,-50


In [15]:
last_column = cookbook_df.pop('CCC')
last_column

0    100
1     50
2    -30
3    -50
Name: CCC, dtype: int64

In [16]:
cookbook_df

Unnamed: 0,AAA
0,4
1,5
2,6
3,7


##### add a new column using a Python list

In [17]:
cookbook_df['DDD'] = [32, 21, 43, 'hike']
cookbook_df

Unnamed: 0,AAA,DDD
0,4,32
1,5,21
2,6,43
3,7,hike


##### insert function

In [18]:
cookbook_df.insert(1, "new column", [3,4,5,6])
cookbook_df

Unnamed: 0,AAA,new column,DDD
0,4,3,32
1,5,4,21
2,6,5,43
3,7,6,hike


# Indexing and Selection

| Operation                     | Syntax         | Result    |
|-------------------------------|----------------|-----------|
| Select column                 | df[col]        | Series    |
| Select row by label           | df.loc[label]  | Series    |
| Select row by integer         | df.iloc[loc]   | Series    |
| Select rows                   | df[start:stop] | DataFrame |
| Select rows with boolean mask | df[mask]       | DataFrame |

In [19]:
import pandas as pd
import numpy as np

In [20]:
produce_dict = {'veggies': ['potatoes', 'onions', 'peppers', 'carrots'],'fruits': ['apples', 'bananas', 'pineapple', 'berries']}
produce_df = pd.DataFrame(produce_dict)
produce_df

Unnamed: 0,fruits,veggies
0,apples,potatoes
1,bananas,onions
2,pineapple,peppers
3,berries,carrots


##### selection using dictionary-like string

In [21]:
produce_df['fruits']

0       apples
1      bananas
2    pineapple
3      berries
Name: fruits, dtype: object

##### list of strings as index (note: double square brackets)

In [22]:
produce_df[ ['fruits', 'veggies'] ]

Unnamed: 0,fruits,veggies
0,apples,potatoes
1,bananas,onions
2,pineapple,peppers
3,berries,carrots


##### select row using integer index

In [23]:
produce_df.iloc[2]

fruits     pineapple
veggies      peppers
Name: 2, dtype: object

##### select rows using integer slice

In [24]:
produce_df.iloc[0:2]

Unnamed: 0,fruits,veggies
0,apples,potatoes
1,bananas,onions


In [25]:
produce_df.iloc[:-2]

Unnamed: 0,fruits,veggies
0,apples,potatoes
1,bananas,onions


##### + is over-loaded as concatenation operator

In [26]:
produce_df + produce_df.iloc[0]

Unnamed: 0,fruits,veggies
0,applesapples,potatoespotatoes
1,bananasapples,onionspotatoes
2,pineappleapples,pepperspotatoes
3,berriesapples,carrotspotatoes


### Data alignment and arithmetic
Data alignment between DataFrame objects automatically align on both the columns and the index (row labels).

Note locations for 'NaN'

In [27]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
sum_df = df + df2
sum_df

Unnamed: 0,A,B,C,D
0,1.098236,-0.374752,-1.000263,
1,-1.433531,-0.969911,0.096911,
2,-2.364889,1.692996,-1.510949,
3,1.143989,0.954158,-0.786073,
4,-0.715075,1.017342,0.112424,
5,-0.920792,0.184539,2.247119,
6,0.367535,0.435628,-1.312279,
7,,,,
8,,,,
9,,,,


### Boolean indexing

In [28]:
sum_df>0

Unnamed: 0,A,B,C,D
0,True,False,False,False
1,False,False,True,False
2,False,True,False,False
3,True,True,False,False
4,False,True,True,False
5,False,True,True,False
6,True,True,False,False
7,False,False,False,False
8,False,False,False,False
9,False,False,False,False


In [29]:
sum_df[sum_df>0]

Unnamed: 0,A,B,C,D
0,1.098236,,,
1,,,0.096911,
2,,1.692996,,
3,1.143989,0.954158,,
4,,1.017342,0.112424,
5,,0.184539,2.247119,
6,0.367535,0.435628,,
7,,,,
8,,,,
9,,,,


 first select rows in column B whose values are less than zero
 
 then, include information for all columns in that row in the resulting data set

In [30]:
mask = sum_df['B'] < 0
mask

0     True
1     True
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: B, dtype: bool

In [31]:
sum_df[mask]

Unnamed: 0,A,B,C,D
0,1.098236,-0.374752,-1.000263,
1,-1.433531,-0.969911,0.096911,


##### isin function

In [32]:
produce_df.isin(['apples', 'onions'])

Unnamed: 0,fruits,veggies
0,True,False
1,False,True
2,False,False
3,False,False


##### where function

In [33]:
produce_df.where(produce_df > 'k')

Unnamed: 0,fruits,veggies
0,,potatoes
1,,onions
2,pineapple,peppers
3,,


# NumPy Universal Functions

If the data within a DataFrame are numeric, NumPy's universal functions can be used on/with the DataFrame.

In [34]:
import pandas as pd
import numpy as np

In [35]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
sum_df = df + df2
sum_df

Unnamed: 0,A,B,C,D
0,0.764238,-1.948982,-0.347591,
1,0.417359,0.148641,-0.396326,
2,-0.895075,0.951687,1.301042,
3,-0.390659,-0.752704,-0.24334,
4,-1.603647,1.821259,0.756104,
5,3.913272,1.199506,0.092893,
6,0.23418,-3.603053,0.286862,
7,,,,
8,,,,
9,,,,


##### NaN are handled correctly by universal function

In [36]:
np.exp(sum_df)

Unnamed: 0,A,B,C,D
0,2.147357,0.142419,0.706388,
1,1.517948,1.160256,0.672787,
2,0.408577,2.590075,3.673123,
3,0.676611,0.471091,0.784005,
4,0.201162,6.179634,2.129962,
5,50.062506,3.318478,1.097344,
6,1.263872,0.02724,1.332241,
7,,,,
8,,,,
9,,,,


##### Transpose availabe T attribute

In [37]:
sum_df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
A,0.764238,0.417359,-0.895075,-0.390659,-1.603647,3.913272,0.23418,,,
B,-1.948982,0.148641,0.951687,-0.752704,1.821259,1.199506,-3.603053,,,
C,-0.347591,-0.396326,1.301042,-0.24334,0.756104,0.092893,0.286862,,,
D,,,,,,,,,,


In [38]:
np.transpose(sum_df.values)

array([[ 0.76423795,  0.41735915, -0.89507458, -0.39065921, -1.60364688,
         3.91327234,  0.23417986,         nan,         nan,         nan],
       [-1.94898211,  0.14864087,  0.951687  , -0.75270443,  1.82125898,
         1.1995061 , -3.60305264,         nan,         nan,         nan],
       [-0.34759103, -0.39632645,  1.30104219, -0.24333956,  0.75610399,
         0.0928929 ,  0.28686213,         nan,         nan,         nan],
       [        nan,         nan,         nan,         nan,         nan,
                nan,         nan,         nan,         nan,         nan]])

##### dot method on DataFrame implements matrix multiplication
Note: row and column headers

In [39]:
A_df = pd.DataFrame(np.arange(15).reshape((3,5)))
B_df = pd.DataFrame(np.arange(10).reshape((5,2)))
A_df.dot(B_df)

Unnamed: 0,0,1
0,60,70
1,160,195
2,260,320


##### dot method on Series implements dot product

In [40]:
C_Series = pd.Series(np.arange(5,10))
C_Series.dot(C_Series)

255

# Creating Panels
A Panel is a three-dimensional analogue of DataFrame.  Each item (the analogue of columns in a DataFrame) in a Panel is a DataFrame.

The term panel data is derived
from econometrics and is partially responsible for the name pandas: pan(el)-da(ta)-s. The names for the 3 axes within a panel are intended to give some semantic meaning to describing operations involving panel data and, in particular, econometric
analysis of panel data. However, for the strict purposes of slicing and dicing a collection of DataFrame objects, you
may find the axis names slightly arbitrary:

- items: axis 0, each item corresponds to a DataFrame contained inside
- major_axis: axis 1, it is the index (rows) of each of the DataFrames
- minor_axis: axis 2, it is the columns of each of the DataFrames

In [41]:
import pandas as pd
import numpy as np
import datetime
from pandas_datareader import data, wb

pd.set_eng_float_format(accuracy=2, use_eng_prefix=True)

In [42]:
my_first_panel = pd.Panel(np.random.randn(2, 5, 4), 
                          items=['Item01', 'Item02'],
                          major_axis=pd.date_range('9/6/2016', periods=5),
                          minor_axis=['A', 'B', 'C', 'D'])
my_first_panel

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: Item01 to Item02
Major_axis axis: 2016-09-06 00:00:00 to 2016-09-10 00:00:00
Minor_axis axis: A to D

### From dict of DataFrame objects
Note that the values in the dict need only be convertible to DataFrame.

In [43]:
dictionary_of_data_frames = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
                             'Item2' : pd.DataFrame(np.random.randn(4, 2))}
my_dictionary_panel = pd.Panel(dictionary_of_data_frames)
my_dictionary_panel

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

##### Panel.from_dict()
One helpful factory method is Panel.from_dict, which takes a dictionary of DataFrames, and has the following named parameters:

| Parameter | Default | Description                                         |
|-----------|---------|-----------------------------------------------------|
| intersect | False   | drops elements whose indices do not align           |
| orient    | items   | use minor to use DataFramesâ€™ columns as panel items |

Orient is especially useful for mixed-type DataFrames. If you pass a dict of DataFrame objects with mixed-type columns, all of the data will get upcasted to dtype=object unless you pass orient='minor':

In [44]:
oriented_panel = pd.Panel.from_dict(dictionary_of_data_frames, orient='minor')
oriented_panel

Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/.
Pandas provides a `.to_xarray()` method to help automate this conversion.

  """Entry point for launching an IPython kernel.


<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 4 (major_axis) x 2 (minor_axis)
Items axis: 0 to 2
Major_axis axis: 0 to 3
Minor_axis axis: Item1 to Item2