[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/prof-tcsmith/mis307.git/HEAD?labpath=notebooks%2Fpandas_03.ipynb)

<a href="https://colab.research.google.com/github/prof-tcsmith/mis307/blob/master/notebooks/pandas_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# DataFrames Introduction
Pandas Series objects are a powerful means through which to organize and analyze our data, but we often look to analyze two-dimensional data as well. 

Pandas DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

<sub>(source: http://pandas.pydata.org/pandas-docs/stable/dsintro.html)</sub>

You can think of DataFrames as a collection of columns of series data:

In [1]:
import pandas as pd
import matplotlib

index_names = ("First", "Second", "Third")
a = (1,2,3)
b = [4.4,5.567,6.987645]
c = {"Seventh":7,'Eigth':8,"Ninth":9}



In [2]:
s1 = pd.Series(a, index=index_names)
s1

First     1
Second    2
Third     3
dtype: int64

In [3]:
s2 = pd.Series(b, index=index_names) 
s2

First     4.400000
Second    5.567000
Third     6.987645
dtype: float64

In [4]:
df = pd.DataFrame([s1, s2])
df

Unnamed: 0,First,Second,Third
0,1.0,2.0,3.0
1,4.4,5.567,6.987645


In [5]:
df.describe()

Unnamed: 0,First,Second,Third
count,2.0,2.0,2.0
mean,2.7,3.7835,4.993823
std,2.404163,2.52225,2.819691
min,1.0,2.0,3.0
25%,1.85,2.89175,3.996911
50%,2.7,3.7835,4.993823
75%,3.55,4.67525,5.990734
max,4.4,5.567,6.987645


There are many ways you can create DataFrame, for instance, from a dictionary of lists:

In [6]:
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]}
d

{'one': [1.0, 2.0, 3.0, 4.0], 'two': [4.0, 3.0, 2.0, 1.0]}

In [7]:
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


Each column of a dataframe has a given name, while each row has a numbered index that is automatically generated by the DataFrame object:

In [8]:
df.index

RangeIndex(start=0, stop=4, step=1)

We can easily add colums to a DataFrame:

In [9]:
df['three']=df['one']*df['two']
df

Unnamed: 0,one,two,three
0,1.0,4.0,4.0
1,2.0,3.0,6.0
2,3.0,2.0,6.0
3,4.0,1.0,4.0


In [10]:
df['four']=[10,11,12,13]
df

Unnamed: 0,one,two,three,four
0,1.0,4.0,4.0,10
1,2.0,3.0,6.0,11
2,3.0,2.0,6.0,12
3,4.0,1.0,4.0,13


Or, if we wish, delete any row within a DataFrame:

In [11]:
del df['one']
df

Unnamed: 0,two,three,four
0,4.0,4.0,10
1,3.0,6.0,11
2,2.0,6.0,12
3,1.0,4.0,13


NOTE: Delete simply deletes the column, while pop deletes the column and returns the colums (as a series)

In [12]:
ret_s = df.pop('three')
ret_s

0    4.0
1    6.0
2    6.0
3    4.0
Name: three, dtype: float64

In [13]:
df

Unnamed: 0,two,four
0,4.0,10
1,3.0,11
2,2.0,12
3,1.0,13


And, as we might expect -- there are number of operations we can use to manipulate columns:

In [14]:
df2 = pd.DataFrame() # create a blank data frame
df2['two']=df['two']
df2['two_mean']=df2['two'].mean()
df2['two_stdev']=df2['two'].std()
df2['two_standardized']= (df2['two']-df2['two_mean'])/df2['two_stdev'] # add a column two, and use values from df to create the values in this column
df2

Unnamed: 0,two,two_mean,two_stdev,two_standardized
0,4.0,2.5,1.290994,1.161895
1,3.0,2.5,1.290994,0.387298
2,2.0,2.5,1.290994,-0.387298
3,1.0,2.5,1.290994,-1.161895


And, we can iterator through all columns of a dataframe as follows:

In [16]:
df3 = pd.DataFrame()
for column in df:
    df3[column]=df[column]
    df3[column+"_mean"]=df[column].mean()
    df3[column+"_stdev"]=df[column].std()
    df3[column+'_standardized']= (df3[column]-df3[column+'_mean'])/df3[column+'_stdev']
df3

Unnamed: 0,two,two_mean,two_stdev,two_standardized,four,four_mean,four_stdev,four_standardized
0,4.0,2.5,1.290994,1.161895,10,11.5,1.290994,-1.161895
1,3.0,2.5,1.290994,0.387298,11,11.5,1.290994,-0.387298
2,2.0,2.5,1.290994,-0.387298,12,11.5,1.290994,0.387298
3,1.0,2.5,1.290994,-1.161895,13,11.5,1.290994,1.161895
