# DataFrames Pt. 1

> DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

# In Pt. 1 we cover the following : 
* Create a basic DataFrame
* Indexing
* Selection
* Dropping rows and cols and importance of inplace parameter.
* Reasoning behind axis = 0 for rows and axis = 1 for columns.
* Accessing rows and cols in DataFrame

In [1]:
import numpy as np
import pandas as pd

In [2]:
from numpy.random import randn

In [22]:
# For having gridlines

In [23]:
%%HTML
<style type="text/css">
table.dataframe td, table.dataframe th {
    border: 1px  black solid !important;
  color: black !important;
}

In [3]:
#  Setting a seed -> Seed makes sure that we get the same random numbers.
np.random.seed(101)

In [4]:
df = pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z'])

In [15]:
df # Gives us a list of columns W X Y Z, and rows A B C D E.
# Each of the columns is a pandas sereis, W X Y and Z are series sharing a common index.
# That's what dataframe is, a bunch of series that shares an index.

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


# Indexing and Selection

In [7]:
df['W']# Grabs W column which looks like a series. Always use this way to grab a column.

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [8]:
type(df['W']) # Shows that it is a series

pandas.core.series.Series

In [9]:
type(df)

pandas.core.frame.DataFrame

In [11]:
df.W # If familiar with SQL a lot of times while selecting a column you pass in the table.col_name and this works too!
# Not recommended!

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [14]:
# You can also pass in a list of columns
df[['W','Z']] # Asking for multiple columns you get back a DataFrame while a single column is just a series.

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [16]:
df['new'] = df['W']+df['Y']
# When creating a new columnm, we can define it as if it already exists and on the RHS of = sign use other cols with 
# arithmetic to make a new column.

In [24]:
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


**To remove a column we use df.drop() and pass in the column name. Also we need to refer to axis = 1, by default axis is set to 0.**

In [26]:
df.drop('new',axis=1)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [27]:
# df.drop() usage doesn't actually affects the DataFrame as we can see below on calling DataFrame df.
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [28]:
# To actually remove column new we have to enter the parameter inplace = True
# Pandas does this so that we do not lose valuable information while dropping.
df.drop('new',axis=1,inplace=True)

In [30]:
df#New column permanently removed.

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [32]:
# df.drop() also used to drop rows.
df.drop('E',axis=0)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


**Another point of confusion is why are rows have axis = 0 and columns have axis = 1.
Its reference comes back to NumPy. Since DataFrames are just fancy index markers on top of a NumPy Array.**

In [36]:
# As a proof of logic we can do the following
df.shape

(5, 4)

**Notice that df.shape is a tuple for a 2-D matrix, at 0 index  is the number of rows and columns at index1.**

**Therefore rows as axis = 0 and columns as axis = 1 .**

# Rows
* 1st method to grab a row : Based on the label of index
* loc[]

In [37]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [41]:
# Multiple ways to select rows by making use of method(). 
# 1. loc (location) -> Takes label as input
df.loc['C'] # Though it is a method still we use square brackets here, that's how it works with Pandas.


W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

**df.loc['row_you_want']**
* returns a series. 
* Therefore, not only all columns are series but rows are as well and are returned as series when requested.

> 2nd method to grab a row : Based on the index position instead of label
# iloc : 
> index based location Used to pass in a numerical index position, even if axes are labelled by strings.

In [43]:
df.iloc[2] # Numerical based index.

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

In [44]:
# To select subsets of rows and columns. Similar to numpy
df.loc['B','Y']# df.loc['row_we_want','column_we_want']

-0.8480769834036315

In [46]:
# A to E rows with W & Y columns
df.loc[['A','B','C','D','E'],['W','Y']]

Unnamed: 0,W,Y
A,2.70685,0.907969
B,0.651118,-0.848077
C,-2.018168,0.528813
D,0.188695,-0.933237
E,0.190794,2.605967


# DataFrames Pt. 2

# In Pt. 2 we cover the following : 
* Conditional Selection
* 
* 
* 
* 
* 

In [47]:
#  We can perform conditional selection in Pandas using bracket notation.
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [51]:
booldf = df > 0 #Using a comparison operator against the DataFrame gives a DataFrame of boolean values. 
# Similar to what happens to a numpy array when you just apply a conditional selection.

In [52]:
booldf

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [54]:
df[booldf] #We will get values where the condition was True and NaN (Not a Number) for all false locations.

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [59]:
# Ideal way to do contional selection is what's defined below. What we did above was just for ease of understanding.
df[df>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


> **But the method above of conditional selection is also uncommon and most likely instead of just passing the entire data frame, we would pass a row or a column value and instead of returning NaN it will return only the rows or columns of the subsets of the dataframe where conditions are true.**

In [61]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [62]:
df['W']>0

A     True
B     True
C    False
D     True
E     True
Name: W, dtype: bool

In [64]:
df['W'] # Value at index C is less than 0, so returns false for being greater than zero.

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [65]:
# Now we can use the series of boolean values corresponding to rows which is shown  above to filter out rows
# based on a column's value.
df[df['W']>0] # Returns only the rows where condition is true. We use this type of selection a lot!
# As we are passing the series we do not get null values anymore.
# Null values only occur when you perform a conditional statement on the entire DataFrame.

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [66]:
#  To grab all the rows in the dataframe where Z < 0
df[df['Z']<0]

Unnamed: 0,W,X,Y,Z
C,-2.018168,0.740122,0.528813,-0.589001


In [67]:
resultdf = df[df['W']>0] # Note that we get a DF in response. And this means we can call commands on this DF.
# We can do so in 1 or 2 steps.

In [71]:
resultdf

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [72]:
resultdf['X'] # Grabbing the X column from the resultdf DF where C is not present. We do this here in 2 steps.

A    0.628133
B   -0.319318
D   -0.758872
E    1.978757
Name: X, dtype: float64

In [None]:
# Doing it in 1 step will look like what's described below :
df[df['W']>] # Return the dataframe where column value of W > 0 i.e. All Rows 