# DataFrames Pt. 1

> DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

# In Pt. 1 we cover the following : 
* Create a basic DataFrame
* Indexing
* Selection
* Dropping rows and cols and importance of inplace parameter.
* Reasoning behind axis = 0 for rows and axis = 1 for columns.
* Accessing rows and cols in DataFrame

In [None]:
import numpy as np
import pandas as pd

In [None]:
from numpy.random import randn

In [None]:
# For having gridlines

In [None]:
%%HTML
<style type="text/css">
table.dataframe td, table.dataframe th {
    border: 1px  black solid !important;
  color: black !important;
}

In [None]:
#  Setting a seed -> Seed makes sure that we get the same random numbers.
np.random.seed(101)

In [None]:
df = pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z'])

In [None]:
df # Gives us a list of columns W X Y Z, and rows A B C D E.
# Each of the columns is a pandas sereis, W X Y and Z are series sharing a common index.
# That's what dataframe is, a bunch of series that shares an index.

# Indexing and Selection

In [None]:
df['W']# Grabs W column which looks like a series. Always use this way to grab a column.

In [None]:
type(df['W']) # Shows that it is a series

In [None]:
type(df)

In [None]:
df.W # If familiar with SQL a lot of times while selecting a column you pass in the table.col_name and this works too!
# Not recommended!

In [None]:
# You can also pass in a list of columns
df[['W','Z']] # Asking for multiple columns you get back a DataFrame while a single column is just a series.

In [None]:
df['new'] = df['W']+df['Y']
# When creating a new columnm, we can define it as if it already exists and on the RHS of = sign use other cols with 
# arithmetic to make a new column.

In [None]:
df

**To remove a column we use df.drop() and pass in the column name. Also we need to refer to axis = 1, by default axis is set to 0.**

In [None]:
df.drop('new',axis=1)

In [None]:
# df.drop() usage doesn't actually affects the DataFrame as we can see below on calling DataFrame df.
df

In [None]:
# To actually remove column new we have to enter the parameter inplace = True
# Pandas does this so that we do not lose valuable information while dropping.
df.drop('new',axis=1,inplace=True)

In [None]:
df#New column permanently removed.

In [None]:
# df.drop() also used to drop rows.
df.drop('E',axis=0)

**Another point of confusion is why are rows have axis = 0 and columns have axis = 1.
Its reference comes back to NumPy. Since DataFrames are just fancy index markers on top of a NumPy Array.**

In [None]:
# As a proof of logic we can do the following
df.shape

**Notice that df.shape is a tuple for a 2-D matrix, at 0 index  is the number of rows and columns at index1.**

**Therefore rows as axis = 0 and columns as axis = 1 .**

# Rows
* 1st method to grab a row : Based on the label of index
* loc[]

In [None]:
df

In [None]:
# Multiple ways to select rows by making use of method(). 
# 1. loc (location) -> Takes label as input
df.loc['C'] # Though it is a method still we use square brackets here, that's how it works with Pandas.


**df.loc['row_you_want']**
* returns a series. 
* Therefore, not only all columns are series but rows are as well and are returned as series when requested.

> 2nd method to grab a row : Based on the index position instead of label
# iloc : 
> index based location Used to pass in a numerical index position, even if axes are labelled by strings.

In [None]:
df.iloc[2] # Numerical based index.

In [None]:
# To select subsets of rows and columns. Similar to numpy
df.loc['B','Y']# df.loc['row_we_want','column_we_want']

In [None]:
# A to E rows with W & Y columns
df.loc[['A','B','C','D','E'],['W','Y']]

# DataFrames Pt. 2

# In Pt. 2 we cover the following : 
* Conditional Selection
* Single line v/s Multiple line abstraction
* Using multiple conditions
* Reason and fix for ambiguous series error
* Modifying the index (Set Index v/s Reset Index)

In [None]:
#  We can perform conditional selection in Pandas using bracket notation.
df

In [None]:
booldf = df > 0 #Using a comparison operator against the DataFrame gives a DataFrame of boolean values. 
# Similar to what happens to a numpy array when you just apply a conditional selection.

In [None]:
booldf

In [None]:
df[booldf] #We will get values where the condition was True and NaN (Not a Number) for all false locations.

In [None]:
# Ideal way to do contional selection is what's defined below. What we did above was just for ease of understanding.
df[df>0]

> **But the method above of conditional selection is also uncommon and most likely instead of just passing the entire data frame, we would pass a row or a column value and instead of returning NaN it will return only the rows or columns of the subsets of the dataframe where conditions are true.**

In [None]:
df

In [None]:
df['W']>0

In [None]:
df['W'] # Value at index C is less than 0, so returns false for being greater than zero.

In [None]:
# Now we can use the series of boolean values corresponding to rows which is shown  above to filter out rows
# based on a column's value.
df[df['W']>0] # Returns only the rows where condition is true. We use this type of selection a lot!
# As we are passing the series we do not get null values anymore.
# Null values only occur when you perform a conditional statement on the entire DataFrame.

In [None]:
#  To grab all the rows in the dataframe where Z < 0
df[df['Z']<0]

In [None]:
resultdf = df[df['W']>0] # Note that we get a DF in response. And this means we can call commands on this DF.
# We can do so in 1 or 2 steps.

In [None]:
resultdf

In [None]:
resultdf['X'] # Grabbing the X column from the resultdf DF where C is not present. We do this here in 2 steps.

In [None]:
# Doing it in 1 step will look like what's described below :
df[df['W']>0]['X'] # Return the dataframe where column value of W > 0 i.e. All Rows - C. Then stacking [] bracket 
# notation on top of that.

In [None]:
df[df['W']>0][['X','Y','Z']] #Since this is a dataframe we can bracket for multiple columns by passing in a list.

In [None]:
#  Line by line version for undestanding of the command above is
boolser = df['W']>0

In [None]:
boolser

In [None]:
result = df[boolser]

In [None]:
result # Entire DataFrame without row C since it was False.

In [None]:
mycols = ['X','Y','Z']

In [None]:
result[mycols] #Only print the mycols columns from result DataFrame.

# Using multiple conditions

In [None]:
df[(df['W']>0) and (df['Y']>1)]# W > 0 and Y>1 
# Gives us an error saying "truth value of a Series is ambiguous."
# It basically means python's and operator can't take into account one series of boolean value with respect to other.

In [None]:
# and operator can take into account only boolean values. For instance
print(True and True)
print(False and True)

In [None]:
# On passing the entire series of boolean values such as 
df['W']>0
# the and operator begins to get confused as it deals with single instances of boolean values only.

In [None]:
# Workaround is to use an & while working with pandas.
df[(df['W']>0) & (df['Y']>1)]# W > 0 and Y>1 

### To avoid ambiguous series error : 
> * **Use & instead of and**
> * **Use | instead of or**


# Modifying the Index

In [None]:
df #Original DF

In [None]:
# To reset the index in range 0....n-1 
df.reset_index() # index gets reset to a column and actual index becomes numerical, all of this to prevent loss of 
#data. Again keep in mind it doesn't occurs inplace and calling back the original df will show us what is above.

In [None]:
df

In [None]:
# To make the change permanent use inplace in the following manner : 
# df.reset_index(inplace=True)

In [None]:
df = pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z']) # To get the original index with string index.

In [None]:
 df

In [None]:
df.reset_index() # Old Index becomes a column of the dataframe.

In [None]:
#Setting the index 

newind = 'CG MP UP TN OR'.split() # Creating a new index
# .split() of a string is a common method for splitting off of all the blank space a quick way to create a list.

In [None]:
newind

In [None]:
#  Putting the list above as column in the DataFrame df
df['States'] = newind # Since the dimensions match. So it will add list as a column.

In [None]:
df # We can see that a new column is added at the end of the df.

In [None]:
# To make the states column as index make use of method set_index 
df.set_index('States') # States column becomes the index.
# Note : Unless you retain the information of your old index, set_index will overwrite the old index and you
# won't be able to retain the old index information as a new column.
# df remains similar to originally defined as inplace = False which is by default.

In [None]:
df

# DataFrames Pt. 3

### In Pt. 3 we cover the following : 

* Multi-Index and Index Hierarchy
* Calling Data from Multi Level Index
* Cross-section xs
* Aedvance review of multi-index topics and index hierachy


> Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

In [None]:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside)) # Using zip function along with list function to make it a list of tuple pairs
hier_index = pd.MultiIndex.from_tuples(hier_index) # Takes in a list which looks like the one below and create a 
# multi_index from it. Upon execution this gives multiindex with several levels.

In [None]:
list(zip(outside,inside))

In [None]:
df = pd.DataFrame(randn(6,2),hier_index,['A','B']) # Makes a DF of 6 rows and 2 columns, 
#index equal to hier index and cols A,B

In [None]:
df # Gives a dataframe of 2 levels of index. Index1- G1,G2 and Index2 - 1,2,3 and cols A and B

In [None]:
df.loc['G1'] # Returns sub dataframe of everything inside G1

In [None]:
df.loc['G1'].loc[1] # Gives everything inside G1's 1st row.
#Basic idea is to call from outside index and continue calling in deeper one level.

In [None]:
# To name the indexes of G1 and 1 2 3 we can do following :
df.index.names # Gives output showing indices do not have a name shown by None

In [None]:
df.index.names = ['Groups','Num']

In [None]:
df

In [None]:
df.loc['G2'].loc[2]['B']

In [None]:
# To index G1 3 [A] -0.925874

df.loc['G1'].loc[3]['A']

In [None]:
df.xs('G1') # Returns a cross-section of rows and columns from a series of DF. Used with multi-level indexes.

In [None]:
df.loc['G1']

In [None]:
# What's nice about cross-section xs is that it has ability to skip or go inside a multilevel index.\
# Say we have dataframe df
df

* ***Aim : To grab all the values of number equal to 1 of G1 and G2 num 1 as well i.e. all values whose num = 1.***
* ***This is hard to achieve in .loc method. But it is easy to do using xs method.***
* ***Specify what you want as far as num = 1, and indicate the second argument level and name of index.***

In [None]:
df.xs(1,level='Num')

In [None]:
# We are able to grab a xs (cross-section) where level = 'Num' and is 1.b

# Great Work! 