# Pandas

## Understanding Index and Columns

#### Changing the coloumn names

In [49]:
df.columns

Index(['W', 'X', 'Y', 'Z'], dtype='object')

In [50]:
df.columns = 'S T U V'.split()
df

Unnamed: 0,S,T,U,V
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [51]:
# Changing the columns to the initial numerics

df.columns = [x for x in range(len(df.columns))]
df

Unnamed: 0,0,1,2,3
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [52]:
df.columns = ['s' , 's', 'u' , 'u']
df

Unnamed: 0,s,s.1,u,u.1
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [54]:
df['s']

# when both the columns share the same name, the selection returns a DataFrame of both the columns

Unnamed: 0,s,s.1
A,2.70685,0.628133
B,0.651118,-0.319318
C,-2.018168,0.740122
D,0.188695,-0.758872
E,0.190794,1.978757


#### Changing Index Names

It is same as changing column names

In [56]:
df.columns = 'W X Y Z'.split()
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [75]:
df.index = 'F G H I J'.split()
df

Unnamed: 0,W,X,Y,Z
F,2.70685,0.628133,0.907969,0.503826
G,0.651118,-0.319318,-0.848077,0.605965
H,-2.018168,0.740122,0.528813,-0.589001
I,0.188695,-0.758872,-0.933237,0.955057
J,0.190794,1.978757,2.605967,0.683509


In [60]:
df.index = 'A B C D E'.split()

## Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

Using Square brackets, it is a column first selection

In [80]:
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [86]:
df[1:3]
# Slicing in square brackts through numbers lead to row selection

Unnamed: 0,W,X,Y,Z
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001


In [87]:
df['A':'C']
# Slicing will always lead to row selection and 
#the last row is included in case of labels and not in case of numerics

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001


In [82]:
df['W':'Y']
#This gives a strange output. It is better to avoid this

Unnamed: 0,W,X,Y,Z


In [62]:
df['W']['C']

-2.018168244037392

In [64]:
df['W'][2]
# refrencing with index or index label is both okay when using square brackets for series

-2.018168244037392

In [84]:
# df[0] # this will give an error - KeyError. You will have to refrence the column with the label only
# To refer by numerical index, we can use iloc

In [85]:
df [['W' , 'Z']] 
# If more than one column names are to be passed then pass in the form of a list

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [33]:
# SQL Syntax (NOT RECOMMENDED!)
df.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

DataFrame Columns are just Series.
Even data frame rows can be considered as series

In [34]:
type(df['W'])

pandas.core.series.Series

In [35]:
type(df)

pandas.core.frame.DataFrame

#### Selecting with pandas inbuilt loc and iloc function

This is a row first selection method

In [36]:
df.loc['A']

# loc function uses row labels to select data
# loc function weirdly uses square brackets instead of paranthesis

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

In [37]:
df.loc['A':'D']

# Can also use slice operator to print rows
# Includes the upto value in slicing, unlike general python

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


In [38]:
df.iloc[0]

# iloc is used for numeric index based row selection

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

In [88]:
df.iloc[0:4]

# Unlike loc, excludes the upto value. This is more like general python

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


In [100]:
df.iloc[0:3].iloc[1:3]

# Till the final series or dataframe loc/iloc can be used any number of times

Unnamed: 0,W,X,Y,Z
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001


#### Selecting subset of rows and columns

In [40]:
df.loc['B','Y']

# Bth row Yth column
# instead of using nested sq. brackets we can use comma
# nested square brackets can be used, however it will throw an error when selecting a column with numeric value

-0.8480769834036315

In [41]:
df.loc[['A','B'],['W','Y']]

Unnamed: 0,W,Y
A,2.70685,0.907969
B,0.651118,-0.848077


In [95]:
df.loc['A':'B','W':'Y']

Unnamed: 0,W,X,Y
A,2.70685,0.628133,0.907969
B,0.651118,-0.319318,-0.848077


In [97]:
df.iloc[0:3].iloc[1:2,1]

B   -0.319318
Name: X, dtype: float64

In [99]:
df.iloc[0:3].iloc[1:3]['X'] # If used a numeric in the last bracket, pandas will throw an error

B   -0.319318
C    0.740122
Name: X, dtype: float64

### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [43]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [44]:
df>0

# The output is a boolean DataFrame 

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [45]:
df[df>0]

# Omits the values at the false position

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [46]:
df[df['W']>0]

# when the condition is on a series, the output is a DataFrame all the rows where the condition was true for the string

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [47]:
df[df['W']>0]['Y']

A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64

In [48]:
df[df['W']>0][['Y','X']]

Unnamed: 0,Y,X
A,0.907969,0.628133
B,-0.848077,-0.319318
D,-0.933237,-0.758872
E,2.605967,1.978757


For two conditions you can use | and & with parenthesis:

In [101]:
df[(df['W']>0) & (df['Y'] > 1)]

Unnamed: 0,W,X,Y,Z
E,0.190794,1.978757,2.605967,0.683509


In [None]:
df[df>0] [df[df>0]['W']>0]

# This can be used to implement conditionals on the new dataframe itself.
# Something like nested conditionals

In [None]:
# the isin function can also be used
# reviews.loc[reviews.country.isin(['Italy', 'France'])]