<a href="https://colab.research.google.com/github/DeanPhillipsOKC/pandas-notes/blob/master/Pandas_DataFrames_Part_Two.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import numpy as np
import pandas as pd

# Pandas DataFrames Part Two


## First Create the DataFrame that we'll be working with

In [0]:
np.random.seed(42)
df = pd.DataFrame(np.random.randint(-100, 100, (5,4)), ['A', 'B', 'C', 'D', 'E'], ['W', 'X', 'Y', 'Z'])

In [0]:
df

Unnamed: 0,W,X,Y,Z
A,2,79,-8,-86
B,6,-29,88,-80
C,2,21,-26,-13
D,16,-1,3,51
E,30,49,-48,-99


## Filtering DataFrames

If a DataFrame is used as a comparison operand, then the comparison will return a new DataFrame of booleans where the values are equal to the result of the comparison for that location

In [0]:
0 < df

Unnamed: 0,W,X,Y,Z
A,True,True,False,False
B,True,False,True,False
C,True,True,False,False
D,True,False,True,True
E,True,True,False,False


In [0]:
df > 0

Unnamed: 0,W,X,Y,Z
A,True,True,False,False
B,True,False,True,False
C,True,True,False,False
D,True,False,True,True
E,True,True,False,False


In [0]:
df[df > 0]

Unnamed: 0,W,X,Y,Z
A,2,79.0,,
B,6,,88.0,
C,2,21.0,,
D,16,,3.0,51.0
E,30,49.0,,


## Filtering by Features

You can use a column indexed DataFrame in a comparison to get a new DataFrame back that contains the booleans in the feature.  The values are based on the result of the comparison for each row in the original DataFrame

In [0]:
df['Y'] > 0

A    False
B     True
C    False
D     True
E    False
Name: Y, dtype: bool

If you pass a comparison like the one above into the index of the original DataFrame, then you will end up with a new DataFrame that only shows rows for which the feature filtering predicate evaluated true.

In [0]:
df[df['Y'] > 0]

Unnamed: 0,W,X,Y,Z
B,6,-29,88,-80
D,16,-1,3,51


We can even chain some more indexes onto the end to only show specific features

In [0]:
df[df['Y'] > 0]['W']

B     6
D    16
Name: W, dtype: int64

In [0]:
df[df['Y'] > 0].iloc[0]

W     6
X   -29
Y    88
Z   -80
Name: B, dtype: int64

## Filtering by multiple features

In [0]:
df

Unnamed: 0,W,X,Y,Z
A,2,79,-8,-86
B,6,-29,88,-80
C,2,21,-26,-13
D,16,-1,3,51
E,30,49,-48,-99


Create the filtering predicates, wrap them in parenthesis,and separate them by a logical opeator

In [0]:
(df['W'] > 0) & (df['Y'] > 0)

A    False
B     True
C    False
D     True
E    False
dtype: bool

Once you have the predicate setup, just pass that into the original DataFrame as an index to only get the rows back that satsify the predicate

In [0]:
df[(df['W'] > 0) & (df['Y'] > 0)]

Unnamed: 0,W,X,Y,Z
B,6,-29,88,-80
D,16,-1,3,51


## Modifying the index

In some cases, you may want to take your indexes, and make them become a column instead of indexes.  When we do this, a new index collection gets created using integers

In [0]:
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,2,79,-8,-86
1,B,6,-29,88,-80
2,C,2,21,-26,-13
3,D,16,-1,3,51
4,E,30,49,-48,-99


Sometimes we want to add a new index to the table.  We would start by creating an array that contains the new index collection.

In [0]:
states = ['CA', 'NY', 'WY', 'OR', 'CO']

Next we just assign the states to the array.  Even though 'States' doens't exist, it's Okay, as the DataFrame will create it.

In [0]:
df['States'] = states

In [0]:
df

Unnamed: 0,W,X,Y,Z,States
A,2,79,-8,-86,CA
B,6,-29,88,-80,NY
C,2,21,-26,-13,WY
D,16,-1,3,51,OR
E,30,49,-48,-99,CO


To make the states column the new index, just use the set_index function

In [0]:
df.set_index('States')

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2,79,-8,-86
NY,6,-29,88,-80
WY,2,21,-26,-13
OR,16,-1,3,51
CO,30,49,-48,-99


## Getting DataFrame info

Describe will return a lot of useful information about the DataFrame, including statistical values such as mean, and standard devition

In [0]:
df.describe()

Unnamed: 0,W,X,Y,Z
count,5.0,5.0,5.0,5.0
mean,11.2,23.8,1.8,-45.4
std,11.96662,42.109381,51.915316,63.366395
min,2.0,-29.0,-48.0,-99.0
25%,2.0,-1.0,-26.0,-86.0
50%,6.0,21.0,-8.0,-80.0
75%,16.0,49.0,3.0,-13.0
max,30.0,79.0,88.0,51.0


The info method returns more mechanical type information about the DataFrame

In [0]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, A to E
Data columns (total 5 columns):
W         5 non-null int64
X         5 non-null int64
Y         5 non-null int64
Z         5 non-null int64
States    5 non-null object
dtypes: int64(4), object(1)
memory usage: 400.0+ bytes
