# DataFrames

DataFrames are made up of Pandas Series that share the same index

In [1]:
import numpy as np
import pandas as pd

In [2]:
from numpy.random import randn

Setting the seed so that the value remains same when anybody tries with result

In [3]:
np.random.seed(101)

## Creating dataFrames

In [4]:
df = pd.DataFrame(randn(5, 4), ['A', 'B', 'C', 'D', 'E'], columns=['w', 'x', 'y', 'z'])
df

Unnamed: 0,w,x,y,z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


### Using indexing to grab series in dataFrame

fetching 'w' column

In [5]:
df['w']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: w, dtype: float64

In [6]:
df['x']

A    0.628133
B   -0.319318
C    0.740122
D   -0.758872
E    1.978757
Name: x, dtype: float64

In [7]:
type(df['x'])

pandas.core.series.Series

In [8]:
type(df)

pandas.core.frame.DataFrame

In [9]:
df[['w','z']]

Unnamed: 0,w,z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


` .  operator and column name` can be used to extract a column, but that column name should not have spaces and it should not have conflict with any existing methods

In [10]:
df.w

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: w, dtype: float64

In [11]:
df.x

A    0.628133
B   -0.319318
C    0.740122
D   -0.758872
E    1.978757
Name: x, dtype: float64

In [12]:
df.y

A    0.907969
B   -0.848077
C    0.528813
D   -0.933237
E    2.605967
Name: y, dtype: float64

In [13]:
df.z

A    0.503826
B    0.605965
C   -0.589001
D    0.955057
E    0.683509
Name: z, dtype: float64

---

## Indexing rows using methods
1. df.loc - index with labels of rows and columns
2. df.iloc - index with index number of rows and columns
> Although they are methods but **`[  ]`** are used to index the elements

In [14]:
df.loc['A',:]

w    2.706850
x    0.628133
y    0.907969
z    0.503826
Name: A, dtype: float64

In [15]:
df.loc[['A', 'C', 'E'], :]

Unnamed: 0,w,x,y,z
A,2.70685,0.628133,0.907969,0.503826
C,-2.018168,0.740122,0.528813,-0.589001
E,0.190794,1.978757,2.605967,0.683509


In [16]:
df

Unnamed: 0,w,x,y,z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


**using `df.iloc[row_index, column_index]`**

In [17]:
df.iloc[2:, 2:]

Unnamed: 0,y,z
C,0.528813,-0.589001
D,-0.933237,0.955057
E,2.605967,0.683509


In [18]:
df.iloc[[0,2,4], [1,3]]

Unnamed: 0,x,z
A,0.628133,0.503826
C,0.740122,-0.589001
E,1.978757,0.683509


The above code is similar to below by using df.loc

In [19]:
df.loc[['A','C','E'], ['x','z']]

Unnamed: 0,x,z
A,0.628133,0.503826
C,0.740122,-0.589001
E,1.978757,0.683509


Both methods are efficient it depends on readability and personal choice, but **`df.iloc` comes in handy when the column and row names are long**

---

## Adding new Columns to DataFrame

A new column can be added by using the assignment operator, providing label and values

In [20]:
df['new'] = df['x'] + df['z']
df['new']

A    1.131958
B    0.286647
C    0.151122
D    0.196184
E    2.662266
Name: new, dtype: float64

In [21]:
df

Unnamed: 0,w,x,y,z,new
A,2.70685,0.628133,0.907969,0.503826,1.131958
B,0.651118,-0.319318,-0.848077,0.605965,0.286647
C,-2.018168,0.740122,0.528813,-0.589001,0.151122
D,0.188695,-0.758872,-0.933237,0.955057,0.196184
E,0.190794,1.978757,2.605967,0.683509,2.662266


### Dropping columns

for deleteing a column name, we can use **`DataFrame.drop(column_name, axis=1)`** 
- axis = 0 - for rows
- axis = 1 - for columns

In [22]:
df.drop('new', axis=1)

Unnamed: 0,w,x,y,z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


This does not affect the orginal dataframe, to make it happen we have to provide one more argument, **`inplace=True`**.
- When "inplace = True" is used it returns `None` but changes the orginal dataFrame

In [23]:
df

Unnamed: 0,w,x,y,z,new
A,2.70685,0.628133,0.907969,0.503826,1.131958
B,0.651118,-0.319318,-0.848077,0.605965,0.286647
C,-2.018168,0.740122,0.528813,-0.589001,0.151122
D,0.188695,-0.758872,-0.933237,0.955057,0.196184
E,0.190794,1.978757,2.605967,0.683509,2.662266


In [24]:
# this will drop the 'new' clumn in the dataFrame
df.drop('new', axis=1, inplace=True)

In [25]:
df

Unnamed: 0,w,x,y,z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


### dropping the rows

In [26]:
df.drop('E')

Unnamed: 0,w,x,y,z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


In [27]:
df

Unnamed: 0,w,x,y,z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [28]:
df.drop('E', inplace=True)

In [29]:
df

Unnamed: 0,w,x,y,z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


In [30]:
df.shape

(4, 4)

In [31]:
df.dtypes

w    float64
x    float64
y    float64
z    float64
dtype: object

---

# Conditional Selection

- Conditional selection is similar to numpy Series, 
- Conditional operators evaluates a condition and returns boolean for each value in a dataFrame
- These boolean values can further be used to select the values that passed the criteria

In [32]:
df = pd.DataFrame(randn(5, 4), ['A', 'B', 'C', 'D', 'E'], columns=['w', 'x', 'y', 'z'])
df

Unnamed: 0,w,x,y,z
A,0.302665,1.693723,-1.706086,-1.159119
B,-0.134841,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
D,-0.497104,-0.75407,-0.943406,0.484752
E,-0.116773,1.901755,0.238127,1.996652


In [33]:
df > 0

Unnamed: 0,w,x,y,z
A,True,True,False,False
B,False,True,True,True
C,True,True,True,True
D,False,False,False,True
E,False,True,True,True


In [34]:
df[df>0]

Unnamed: 0,w,x,y,z
A,0.302665,1.693723,,
B,,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
D,,,,0.484752
E,,1.901755,0.238127,1.996652


In real situation we do not have to pass on the whole dataframes, but we can pass the series or a column condition to see which rows passes our criteria

In [35]:
df['y']>0

A    False
B     True
C     True
D    False
E     True
Name: y, dtype: bool

In [36]:
# passing value of conditional series
df[df['y']>0]

Unnamed: 0,w,x,y,z
B,-0.134841,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
E,-0.116773,1.901755,0.238127,1.996652


Only one row is return but it is return in the form a dataFrame and not a series

In [37]:
df['x']>0

A     True
B     True
C     True
D    False
E     True
Name: x, dtype: bool

In [38]:
# passing the values inside the df
df[df['x']>0]

Unnamed: 0,w,x,y,z
A,0.302665,1.693723,-1.706086,-1.159119
B,-0.134841,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
E,-0.116773,1.901755,0.238127,1.996652


#### Finding a Series or a column after Operations

In [39]:
df[df['z']>0]

Unnamed: 0,w,x,y,z
B,-0.134841,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
D,-0.497104,-0.75407,-0.943406,0.484752
E,-0.116773,1.901755,0.238127,1.996652


Using the output to choose only column 'x'

In [40]:
df[df['z']>0]['x']

B    0.390528
C    0.072960
D   -0.754070
E    1.901755
Name: x, dtype: float64

OR we can choose multiple columns

In [41]:
df[df['z']>0][['x','y']]

Unnamed: 0,x,y
B,0.390528,0.166905
C,0.07296,0.638787
D,-0.75407,-0.943406
E,1.901755,0.238127


### Having multiple conditions in selection

Instead of using 'and' we have to use ` & `, as **and operator is not able to handle series**

In [42]:
df[(df['z']>0) & (df['x']<0) & (df['y']<0)][['z','y','x']]

Unnamed: 0,z,y,x
D,0.484752,-0.943406,-0.75407


Similarly for `or operator` cannot be used with pandas, we have to use **` |  `**

In [43]:
df[(df['x']>0) | (df['y']>0)]

Unnamed: 0,w,x,y,z
A,0.302665,1.693723,-1.706086,-1.159119
B,-0.134841,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
E,-0.116773,1.901755,0.238127,1.996652


---

## Indexing for dataFrames

To replace the index to numerical 0 to n we can use the function **`dataFrame.reset_index()`**
- this creates a new column "index" containing value of previous index and creates index in numerical
- we have to use `inplace = True` for the action to occur

In [44]:
df

Unnamed: 0,w,x,y,z
A,0.302665,1.693723,-1.706086,-1.159119
B,-0.134841,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
D,-0.497104,-0.75407,-0.943406,0.484752
E,-0.116773,1.901755,0.238127,1.996652


In [45]:
df.reset_index()

Unnamed: 0,index,w,x,y,z
0,A,0.302665,1.693723,-1.706086,-1.159119
1,B,-0.134841,0.390528,0.166905,0.184502
2,C,0.807706,0.07296,0.638787,0.329646
3,D,-0.497104,-0.75407,-0.943406,0.484752
4,E,-0.116773,1.901755,0.238127,1.996652


### creating new index

if we have a column which we want to create as 'index' we can use **`df.set_index(column_name)`**

In [46]:
new_index = "CA NY WY OR CO".split()
new_index

['CA', 'NY', 'WY', 'OR', 'CO']

In [47]:
df['states'] = new_index
df

Unnamed: 0,w,x,y,z,states
A,0.302665,1.693723,-1.706086,-1.159119,CA
B,-0.134841,0.390528,0.166905,0.184502,NY
C,0.807706,0.07296,0.638787,0.329646,WY
D,-0.497104,-0.75407,-0.943406,0.484752,OR
E,-0.116773,1.901755,0.238127,1.996652,CO


In [48]:
df.set_index('states')

Unnamed: 0_level_0,w,x,y,z
states,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,0.302665,1.693723,-1.706086,-1.159119
NY,-0.134841,0.390528,0.166905,0.184502
WY,0.807706,0.07296,0.638787,0.329646
OR,-0.497104,-0.75407,-0.943406,0.484752
CO,-0.116773,1.901755,0.238127,1.996652


If we want it to be column also we can use drop=False

In [49]:
df.set_index('states', drop=False)

Unnamed: 0_level_0,w,x,y,z,states
states,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CA,0.302665,1.693723,-1.706086,-1.159119,CA
NY,-0.134841,0.390528,0.166905,0.184502,NY
WY,0.807706,0.07296,0.638787,0.329646,WY
OR,-0.497104,-0.75407,-0.943406,0.484752,OR
CO,-0.116773,1.901755,0.238127,1.996652,CO


If we want to append it to previous index we can use, `append =True`

In [50]:
df.set_index('states', append=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,w,x,y,z
Unnamed: 0_level_1,states,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,CA,0.302665,1.693723,-1.706086,-1.159119
B,NY,-0.134841,0.390528,0.166905,0.184502
C,WY,0.807706,0.07296,0.638787,0.329646
D,OR,-0.497104,-0.75407,-0.943406,0.484752
E,CO,-0.116773,1.901755,0.238127,1.996652


---

# Multi-Index

Multi-index is hiearchial indexing, where the indexes have hiearchy

In [51]:
lst1 = ['A1','A1', 'A1', 'B1', 'B1','B1']
lst2 = [1, 2, 3, 1, 2, 3]
index = list(zip(lst1, lst2))
index

[('A1', 1), ('A1', 2), ('A1', 3), ('B1', 1), ('B1', 2), ('B1', 3)]

In [52]:
indx = pd.MultiIndex.from_tuples(index)
indx

MultiIndex(levels=[['A1', 'B1'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

In [53]:
df = pd.DataFrame(randn(6,2), indx,['A', 'B'])
df

Unnamed: 0,Unnamed: 1,A,B
A1,1,-0.993263,0.1968
A1,2,-1.136645,0.000366
A1,3,1.025984,-0.156598
B1,1,-0.031579,0.649826
B1,2,2.154846,-0.610259
B1,3,-0.755325,-0.346419


We can use **`df.loc[first_index].loc[second_level_index, columns]`**

In [54]:
df.loc['A1']

Unnamed: 0,A,B
1,-0.993263,0.1968
2,-1.136645,0.000366
3,1.025984,-0.156598


In [55]:
df.loc['A1'].loc[[1,2],:]

Unnamed: 0,A,B
1,-0.993263,0.1968
2,-1.136645,0.000366


if we want to use iloc we can use that in following way

In [56]:
df.loc['B1'].iloc[[0,1],:]

Unnamed: 0,A,B
1,-0.031579,0.649826
2,2.154846,-0.610259


**if we want to find the names of index**

In [57]:
df.index.names

FrozenList([None, None])

In [58]:
df.index.names = ['Outer', 'inner']

In [59]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Outer,inner,Unnamed: 2_level_1,Unnamed: 3_level_1
A1,1,-0.993263,0.1968
A1,2,-1.136645,0.000366
A1,3,1.025984,-0.156598
B1,1,-0.031579,0.649826
B1,2,2.154846,-0.610259
B1,3,-0.755325,-0.346419


### Cross - section

In [60]:
df.xs('B1')

Unnamed: 0_level_0,A,B
inner,Unnamed: 1_level_1,Unnamed: 2_level_1
1,-0.031579,0.649826
2,2.154846,-0.610259
3,-0.755325,-0.346419


We want to fetch inner value 1 for both 'A1' and 'B1'

In [61]:
df.xs(1, level='inner')

Unnamed: 0_level_0,A,B
Outer,Unnamed: 1_level_1,Unnamed: 2_level_1
A1,-0.993263,0.1968
B1,-0.031579,0.649826
