# Data Frames : Part 1

### The Basics.

In [1]:
import numpy as np
import pandas as pd
from numpy.random import randn

In [2]:
np.random.seed(10)

In [3]:
df = pd.DataFrame(randn(5,4), ['A', 'B', 'C', 'D', 'E'], ['W','X', 'Y','Z'])
df

Unnamed: 0,W,X,Y,Z
A,1.331587,0.715279,-1.5454,-0.008384
B,0.621336,-0.720086,0.265512,0.108549
C,0.004291,-0.1746,0.433026,1.203037
D,-0.965066,1.028274,0.22863,0.445138
E,-1.136602,0.135137,1.484537,-1.079805


A data frame consists of a collections of series (rows, columns).

In [4]:
type(df['W'])

pandas.core.series.Series

In [5]:
df['W']

A    1.331587
B    0.621336
C    0.004291
D   -0.965066
E   -1.136602
Name: W, dtype: float64

In [6]:
df[['W', 'Z']] 

Unnamed: 0,W,Z
A,1.331587,-0.008384
B,0.621336,0.108549
C,0.004291,1.203037
D,-0.965066,0.445138
E,-1.136602,-1.079805


In [7]:
df['new'] = df['W'] + df['Y']
df['new']

A   -0.213814
B    0.886848
C    0.437318
D   -0.736436
E    0.347935
Name: new, dtype: float64

If you want to delete a columnm, then you will need to use the drop function. ` .drop() ` can be used to drop a row or column, however, if you were to drop the entire column then you will need to specify the axis and to ensure that inplace is set to TRUE 

`df.drop('new', axis = 1, inplace = True)`



In [8]:
df.drop('new', axis = 1, inplace = True)

As we can see above, the new column has been deleted from the table.

In [9]:
df

Unnamed: 0,W,X,Y,Z
A,1.331587,0.715279,-1.5454,-0.008384
B,0.621336,-0.720086,0.265512,0.108549
C,0.004291,-0.1746,0.433026,1.203037
D,-0.965066,1.028274,0.22863,0.445138
E,-1.136602,0.135137,1.484537,-1.079805


In [10]:
df.drop('E')
#This is to drop a row.

Unnamed: 0,W,X,Y,Z
A,1.331587,0.715279,-1.5454,-0.008384
B,0.621336,-0.720086,0.265512,0.108549
C,0.004291,-0.1746,0.433026,1.203037
D,-0.965066,1.028274,0.22863,0.445138


In [11]:
df.shape

(5, 4)

We previously said to select a column we can use the following code `df['W']` and to select multiple columns then we can create a list of columns to select as such `df[['W', 'Z']]`. 

However, to select rows we will need to use the following code `df.loc['A']` and `df.iloc[0]`.

The first method for rows is to locate the row using its label, and the second method is to locate the row using its index location.

In [12]:
df.loc['A']

W    1.331587
X    0.715279
Y   -1.545400
Z   -0.008384
Name: A, dtype: float64

In [13]:
df.iloc[0]

W    1.331587
X    0.715279
Y   -1.545400
Z   -0.008384
Name: A, dtype: float64

In [14]:
df.loc['B','Y']

0.2655115856921195

You can also select multiple data points we can pass them through a list of the location of the data points in rows and columns.

In [15]:
df.loc[['A', 'B'], ['W', 'Y']]

Unnamed: 0,W,Y
A,1.331587,-1.5454
B,0.621336,0.265512


---

# Data Frames : Part 2


### Conditional Selection and multi-index parts of the DataFrame

In [16]:
booldf = df > 0

#This will return a boolean value indicating whether a cell or data point is greater than zero.

booldf

Unnamed: 0,W,X,Y,Z
A,True,True,False,False
B,True,False,True,True
C,True,False,True,True
D,False,True,True,True
E,False,True,True,False


And if we were to pass the booldf variable to
our original dataframe then we would have values
where the data point is greater than zero and null values where the data point is less than zero.

However you can just pass the df > 0 argument to the data frame and you'd get the same results.

In [17]:

df[booldf]


Unnamed: 0,W,X,Y,Z
A,1.331587,0.715279,,
B,0.621336,,0.265512,0.108549
C,0.004291,,0.433026,1.203037
D,,1.028274,0.22863,0.445138
E,,0.135137,1.484537,


You can also use the same function to find the data points that are greater than zero in a column.

In [18]:
df['W'] > 0

A     True
B     True
C     True
D    False
E    False
Name: W, dtype: bool

Thus we can filter out the rows that are not greater than zero (aka have a boolean value of false), but passing the argument to the original dataframe.

In [19]:
df[df['W']>0]

Unnamed: 0,W,X,Y,Z
A,1.331587,0.715279,-1.5454,-0.008384
B,0.621336,-0.720086,0.265512,0.108549
C,0.004291,-0.1746,0.433026,1.203037


In [20]:
df[df['Z']<0]

Unnamed: 0,W,X,Y,Z
A,1.331587,0.715279,-1.5454,-0.008384
E,-1.136602,0.135137,1.484537,-1.079805


In [21]:
df[(df['W']>0) & (df['Y'] > 0)] 
#We use the & symbol for AND operation, because pandas deals weirdly with multiple boolean arguments. 
#And for the OR we use the | symbol.

Unnamed: 0,W,X,Y,Z
B,0.621336,-0.720086,0.265512,0.108549
C,0.004291,-0.1746,0.433026,1.203037


In the next portion we will be discussing how to reset the index or setting it to something else.

using the `reset.index()` function will reset the index to its default numeric value. However, this function will not implement the reset unless we specify the `inplace` to be `True`.

In [22]:
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,1.331587,0.715279,-1.5454,-0.008384
1,B,0.621336,-0.720086,0.265512,0.108549
2,C,0.004291,-0.1746,0.433026,1.203037
3,D,-0.965066,1.028274,0.22863,0.445138
4,E,-1.136602,0.135137,1.484537,-1.079805


In [23]:
new_ind = 'CA NY WY OR CO'.split()
new_ind

['CA', 'NY', 'WY', 'OR', 'CO']

In [24]:
df['States'] = new_ind
df

Unnamed: 0,W,X,Y,Z,States
A,1.331587,0.715279,-1.5454,-0.008384,CA
B,0.621336,-0.720086,0.265512,0.108549,NY
C,0.004291,-0.1746,0.433026,1.203037,WY
D,-0.965066,1.028274,0.22863,0.445138,OR
E,-1.136602,0.135137,1.484537,-1.079805,CO


Vice-versa can be done with setting a new index. In the code block above, we created a new variable named new_ind for **States**, we then set the index to be represented by the new variable, as seen below.

Again, to set the new index we will need to make sure that `inplace = True`

In [25]:
df.set_index('States')

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,1.331587,0.715279,-1.5454,-0.008384
NY,0.621336,-0.720086,0.265512,0.108549
WY,0.004291,-0.1746,0.433026,1.203037
OR,-0.965066,1.028274,0.22863,0.445138
CO,-1.136602,0.135137,1.484537,-1.079805


---


# DataFrames : Part 3

Multi-index and Index Hierarchy 

In [26]:
outside = ['G1', 'G1', 'G1','G2' ,'G2', 'G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside, inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)

Below is how a multi-leveled indexed dataframe, otherwise known as an index hierarchy. 



In [34]:
df = pd.DataFrame(randn(6,2), hier_index, ['A','B'])

We can use the `df.loc[]` function to get the data from the outside index (for example, G1) and we can then get the subdata frame of `G1`.

Furthermore, if we were to use the `.loc[]` or `.iloc[]` functions we can get the data from the inside index (for example, 1).

This process can be repeated as many times as the number of available levels to the indices.

In [40]:
df.loc['G1'].iloc[1]

A    1.308473
B    0.195013
Name: 2, dtype: float64

In [43]:
df

Unnamed: 0,Unnamed: 1,A,B
,,,
G1,1.0,0.132708,-0.476142
G1,2.0,1.308473,0.195013
G1,3.0,0.40021,-0.337632
G2,1.0,1.256472,-0.73197
G2,2.0,0.660232,-0.350872
G2,3.0,-0.939433,-0.489337


As seen above the index columns don't appear to have a name, however, we can change that by using the `df.index.names` functions to change the names of the columns. This

In the example below, we will change the names of the columns for **Gs** and the columns for the **numbers**, to represent the groups and names.

In [44]:
df.index.names = ['Groups', 'Number']
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Groups,Number,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,0.132708,-0.476142
G1,2,1.308473,0.195013
G1,3,0.40021,-0.337632
G2,1,1.256472,-0.73197
G2,2,0.660232,-0.350872
G2,3,-0.939433,-0.489337


In [50]:
df.loc['G2'].loc[2].loc['B']

-0.3508718914398713

We can use the `df.xs()` function to get a cross-section of the dataframe. This function allows us to access the inside of a multilevel index

In [52]:
df.xs(1, level="Number")

Unnamed: 0_level_0,A,B
Groups,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,0.132708,-0.476142
G2,1.256472,-0.73197
