___

<a href='https://www.prosperousheart.com/'> <img src='files/learn to code online.png' /></a>
___

DataFrames are the true workhorse of pandas. You'll learn more here.

The DataFrame has the following input options:
- data
- index
- columns
- dtype
- copy

Learn more about these options <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html">here</a>.

In [10]:
import numpy as np
import pandas as pd

In [11]:
# https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.randn.html
from numpy.random import randn

In [12]:
# https://stackoverflow.com/questions/36847022/what-numbers-that-i-can-put-in-numpy-random-seed
# setting the seed ensures we get the same rando numbers
np.random.seed(101)

In [13]:
df = pd.DataFrame(randn(5,4), ["A", "B", "C", "D", "E"], ["W", "X", "Y", "Z"]) # there will be 5 rows & 4 columns as per randn
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


# Accessing Data From A DataFrame

Each column is a Pandas series with common indexes. You will use the same bracket notation to pull data.

In [17]:
# get column (series) W
print(type(df['W']))
df['W']

<class 'pandas.core.series.Series'>


A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [18]:
# There is another way similar to SQL to get the column. The prior way is the norm.
# This is not suggested as it can get confused with varies methods of a DataFrame
df.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [20]:
# To get multiple columns back, you need to pass in a list of column names
df[["W", "Z"]]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [21]:
# You can use this same notation for a single column - suggested to do this to create the habit
df[["W",]]

Unnamed: 0,W
A,2.70685
B,0.651118
C,-2.018168
D,0.188695
E,0.190794


# Add New Columns To A DF

To create a new column, you must create and add it in or add into the DF upon creation. If you try to call a column that doesn't exist - you will get an error.

<div class="alert alert-block alert-warning">The following 3 examples give you <b>NaN</b> - why do you think that is?</div>

In [35]:
df["new"] = df["W"] + df[["Z",]]
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,
B,0.651118,-0.319318,-0.848077,0.605965,
C,-2.018168,0.740122,0.528813,-0.589001,
D,0.188695,-0.758872,-0.933237,0.955057,
E,0.190794,1.978757,2.605967,0.683509,


In [36]:
df["new"] = df[["W",]] + df["Z"]
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,
B,0.651118,-0.319318,-0.848077,0.605965,
C,-2.018168,0.740122,0.528813,-0.589001,
D,0.188695,-0.758872,-0.933237,0.955057,
E,0.190794,1.978757,2.605967,0.683509,


In [43]:
df["new"] = df[["W",]] + df[["Z",]]
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,
B,0.651118,-0.319318,-0.848077,0.605965,
C,-2.018168,0.740122,0.528813,-0.589001,
D,0.188695,-0.758872,-0.933237,0.955057,
E,0.190794,1.978757,2.605967,0.683509,


<div class="alert alert-block alert-warning">Use the block below to see if you can figure it out.</div>

When adding columns, this is proper notation.

In [42]:
df["new"] = df["W"] + df["Z"]
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.210676
B,0.651118,-0.319318,-0.848077,0.605965,1.257083
C,-2.018168,0.740122,0.528813,-0.589001,-2.607169
D,0.188695,-0.758872,-0.933237,0.955057,1.143752
E,0.190794,1.978757,2.605967,0.683509,0.874303


# Dropping Data From A DataFrame

By default, the DataFrame <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html">drop function</a> is set to look at the index or labels first. If you try to drop a column (a whole series) without changing **axis** to 1, you will receive an error.

In [37]:
df.drop('new') # this is expecting a label or row name

KeyError: "['new'] not found in axis"

In [45]:
print(df)
df.drop('new', axis=1) # this is expecting a column

          W         X         Y         Z  new
A  2.706850  0.628133  0.907969  0.503826  NaN
B  0.651118 -0.319318 -0.848077  0.605965  NaN
C -2.018168  0.740122  0.528813 -0.589001  NaN
D  0.188695 -0.758872 -0.933237  0.955057  NaN
E  0.190794  1.978757  2.605967  0.683509  NaN


Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


You'll notice in the cell below that **df** still has the "new" column. This is because the **drop** function does not happen in place. It returns a DataFrame - it does not change the original one unless you reassign it back to the variable **OR** change the _inplace_ input to True.

In [47]:
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,
B,0.651118,-0.319318,-0.848077,0.605965,
C,-2.018168,0.740122,0.528813,-0.589001,
D,0.188695,-0.758872,-0.933237,0.955057,
E,0.190794,1.978757,2.605967,0.683509,


In [48]:
df.drop('new', axis=1, inplace=True) # this is expecting a column & done in place
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


Pandas has this set to False as default to ensure you don't lose information.

What should you do to drop row C?

In [49]:
df.drop('C')  # same as df.drop('C', axis=0)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


<div class="alert alert-block alert-warning">Why does the axis use 0 for rows and 1 for columns?</div>

DataFrames are essentially fancy index markers on top of a numpy array.

You access rows from the 0 access, as it is represented in the 0th place in a shape tuple.
And you access columns from the 1 place in the tuple.

In [52]:
df.shape  # shows tuple (# of rows, # of columns)

(5, 4)

# Selecting Rows From A DataFrame

There are 2 ways to get rows from a DF.

## loc

`df.loc[idx_label]`

In [53]:
df.loc['B']  # returns a series

W    0.651118
X   -0.319318
Y   -0.848077
Z    0.605965
Name: B, dtype: float64

## iloc

This is based on numerical positioning in the DataFrame - regardless of labels. Top row is generally 0.
`df.iloc[num]`

In [54]:
df.iloc[1]  # same as saying df.loc['B'] for this example

W    0.651118
X   -0.319318
Y   -0.848077
Z    0.605965
Name: B, dtype: float64

# Selecting Subsets Of Rows & Columns

## Single Item

`df.loc[row, col]`

In [60]:
print(df)
df.loc['B', 'X']

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509


-0.31931804459303326

## Multiple Item

`df.loc[[list_of_rows], [list_of_cols]]`

This basically takes all the rows and only returns the matching columns for those rows. It is a subset of data, not just a single piece of data.

In [61]:
print(df)
df.loc[['B', 'D'], ['X', 'Z']]

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509


Unnamed: 0,X,Z
B,-0.319318,0.605965
D,-0.758872,0.955057


<div class="alert alert-block alert-info">DIV option 1: alert-info</div>

<div class="alert alert-block alert-success">DIV option 2: success</div>

<div class="alert alert-block alert-warning">DIV option 3:  warning</div>