<div align="right">Python [conda env:PY27_rtclone]</div>
<div align="right">Python [conda env:PY36_clone]</div>

# Working with Pandas DataFrames

This notebook was created in Python 2.7 and cross-tested in Python 3.6.  Code should work in both versions.  It explores basic syntax of working with Pandas Dataframes.  Some useful functions and one line snippets to know:
- `DataFrame(data={key1:[data, data2], key2:[d, d2]}, columns=["col1", "col2"], index=indexVar, dtype=int64)`
- `data_df.astype(dtype= {"wheel_number":"int64", "car_name":"object","minutes_spent":"float64"})`
- `as.dataframe(df_data)`

## TOC
- [Building a Simple DataFrame](#bld1)
  - [from a Dictionary](#dixndf)
  - [Create Empty and Add Rows One by One](#lbldf)
  - [Setting dtypes on Each Column of DF](#lbldf)

In [1]:
# libraries used throughout this notebook
import pandas as pd
import numpy as np

<a id="bld1" name="bld1"></a>
## Building a Simple DataFrame

<a id="dixndf" name="dixndf"></a>
### Create DF from Dictionary

In [2]:
# create a dictionary
stateData = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], \
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
# convert to DataFrame
df_stdt = pd.DataFrame(stateData)
df_stdt

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


In [3]:
print("Index:  %s" %df_stdt.index)
print("Values:  \n%s" %df_stdt.values)
print("Columns: %s" %df_stdt.columns)
print("dataframe.describe():")
df_stdt.describe()

Index:  RangeIndex(start=0, stop=5, step=1)
Values:  
[[1.5 'Ohio' 2000L]
 [1.7 'Ohio' 2001L]
 [3.6 'Ohio' 2002L]
 [2.4 'Nevada' 2001L]
 [2.9 'Nevada' 2002L]]
Columns: Index([u'pop', u'state', u'year'], dtype='object')
dataframe.describe():


Unnamed: 0,pop,year
count,5.0,5.0
mean,2.42,2001.2
std,0.864292,0.83666
min,1.5,2000.0
25%,1.7,2001.0
50%,2.4,2001.0
75%,2.9,2002.0
max,3.6,2002.0


<a id="lbldf" name="lbldf"></a>
### Create Empty And Add Rows One At A Time
A common scenario in code:  a loop or function call needs to add a row to a dataframe but first you need a blank one.  This code may come in handy as that scenario comes up.

In [4]:
solutionPD = pd.DataFrame({ 'disk':[],'fromPeg':[], 'toPeg':[]}, dtype=np.int64 ) 
solutionPD = solutionPD.append(pd.DataFrame({ 'disk':[5],'fromPeg':[1], 'toPeg':[3]}), ignore_index=True)
solutionPD = solutionPD.append(pd.DataFrame({ 'disk':[4],'fromPeg':[1], 'toPeg':[2] }), ignore_index=True)
print("disk column type: %s" %type(solutionPD['disk'][0]))
solutionPD

disk column type: <type 'numpy.int64'>


Unnamed: 0,disk,fromPeg,toPeg
0,5,1,3
1,4,1,2


In [5]:
# Recommended:  To set dtype for all columns, build the Dataframe, 
# and then use astype() to set the columns as shown here

# though online help topics indicate it should be possible to pass in a dictionary or list of tuples
# to set all the datatypes initially, under Python 2.7, this does not seem to work.
# came closest with this sample:
#   solutionPD2 = pd.DataFrame({ 'disk':[],'fromPeg':[], 'toPeg':[], 'text':[], 'notes':[]}, 
#                            dtype=[('disk', 'int64'), ('fromPeg', 'int64' ), ('toPeg', 'float64'), 
#                                   ('text', 'str'), ('notes', 'object')] ) 
# Related links:
#    http://stackoverflow.com/questions/21197774/assign-pandas-dataframe-column-dtypes
#    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html

solutionPD2 = pd.DataFrame({ 'disk':[],'fromPeg':[], 'toPeg':[], 'notes':[], 'text':[]}, dtype=np.int64) 
solutionPD2 = solutionPD2.astype({ 'disk':'int64','fromPeg':'float64', 'toPeg':'float64', 'notes':'object', 'text':'str'})

solutionPD2 = solutionPD2.append(pd.DataFrame({ 'disk':[5],'fromPeg':[1], 'toPeg':[3], 
                                                'notes':'hello', 'text':'more text' }), ignore_index=True)
solutionPD2 = solutionPD2.append(pd.DataFrame({ 'disk':[4],'fromPeg':[1], 'toPeg':[2], 
                                                'notes':'good bye', 'text':'even more text'}), ignore_index=True)

# print("disk column type: %s" %type(solutionPD['disk'][0]))
print(solutionPD2.dtypes)
solutionPD2

disk         int64
fromPeg    float64
notes       object
text        object
toPeg      float64
dtype: object


Unnamed: 0,disk,fromPeg,notes,text,toPeg
0,5,1.0,hello,more text,3.0
1,4,1.0,good bye,even more text,2.0
