# pandas Data Structures:
We have learned about **Series**, lets learn DataFrames (2<sup>nd</sup> workhorse of pandas) to expand our concepts of Series.

* DataFrame
* Grab data (column wise)
* Grab data (raw wise)
* Grabbing an element or a sub-set of the dataframe
* Adding new column
* Deleting the column
* boolean_mask
* boolean_mask(Combine 2 conditions)
* reset_index(), set_index(), head(), tail(), info(), describe()

## DataFrame
* A very simple way to think about the DataFrame is, "bunch of Series together such as they share the same index". <br> 
* A DataFrams is a rectangular table of data that contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc). DataFrame has both a row and column index; it can be thought of as a dictionary of Series all sharing the same index. <br>

&#9758; *A good read for those, who are interested! ([Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do))<br>*

Let's learn **DataFrame with examples:**<br> 

In [1]:
import pandas as pd
import numpy as np

In [2]:
array_2d = np.arange(0, 100).reshape(10,10)

In [3]:
array_2d

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
       [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
       [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
       [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
       [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])

Let's create two labels/indexes:
* for rows 'r1 to r10'
* for columns 'c1 to c10'

Let's start with a simple example, using **`arange()`** and **`reshape()`** together to create a 2D array (matrix).<br>

In [4]:
ind = 'r1 r2 r3 r4 r5 r6 r7 r8 r9 r10'.split()  #['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10']
col = 'c1 c2 c3 c4 c5 c6 c7 c8 c9 c10'.split()

&#9989; *Use **TAB** for auto-complete and **shift + TAB**  for doc.*

In [5]:
# How the index, columns and array_2d look like!
ind

['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10']

In [6]:
col

['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10']

In [7]:
df = pd.DataFrame(data = array_2d, index=ind, columns=col)

In [9]:
df # select * from df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


**df** is our first dataframe. <br>
We have columns, c1 to c10, and their corresponding rows, r1 to r10. <br>
Each column is actually a pandas series, sharing a common index (row labels). <br>

&#9758; Let's learn how to **Grab data** that we need, this is the most important thing we want to learn to move one!<br>

### Columns 

In [22]:
df[['c1', 'c10']]

Unnamed: 0,c1,c10
r1,0,9
r2,10,19
r3,20,29
r4,30,39
r5,40,49
r6,50,59
r7,60,69
r8,70,79
r9,80,89
r10,90,99


In [15]:
df[['c5', 'c2', 'c1']] #select c1, c2 from df

Unnamed: 0,c5,c2,c1
r1,4,1,0
r2,14,11,10
r3,24,21,20
r4,34,31,30
r5,44,41,40
r6,54,51,50
r7,64,61,60
r8,74,71,70
r9,84,81,80
r10,94,91,90


In [None]:
type(df['c10'])

In [23]:
# Grabbing more than one column, pass the list of columns you need! 
df[['c1', 'c10', 'c3']] #select c1, c10, c3 from df 

Unnamed: 0,c1,c10,c3
r1,0,9,2
r2,10,19,12
r3,20,29,22
r4,30,39,32
r5,40,49,42
r6,50,59,52
r7,60,69,62
r8,70,79,72
r9,80,89,82
r10,90,99,92


**df.column_name (e.g. df.c1, df.c2 etc)** can be used to grab a column as well, its good to know but I don't recommend. <br> 
If you press "TAB" after df., you will see lots of available methods, its good not to get confused with these option by using df.column_name.<br>
**Let's try this once**

In [26]:
#df.c5
df['c5']

r1      4
r2     14
r3     24
r4     34
r5     44
r6     54
r7     64
r8     74
r9     84
r10    94
Name: c5, dtype: int32

### Adding new column
Lets try with "+" operation!

In [27]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [28]:
df['new']= df['c1'] + df['c2']  # select *, (c1 + c2) as new from df

In [29]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,new
r1,0,1,2,3,4,5,6,7,8,9,1
r2,10,11,12,13,14,15,16,17,18,19,21
r3,20,21,22,23,24,25,26,27,28,29,41
r4,30,31,32,33,34,35,36,37,38,39,61
r5,40,41,42,43,44,45,46,47,48,49,81
r6,50,51,52,53,54,55,56,57,58,59,101
r7,60,61,62,63,64,65,66,67,68,69,121
r8,70,71,72,73,74,75,76,77,78,79,141
r9,80,81,82,83,84,85,86,87,88,89,161
r10,90,91,92,93,94,95,96,97,98,99,181


In [33]:
df['new'] = [1,2,3,4,5,6,7,8,9,10]
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,new,new2
r1,0,1,2,3,4,5,6,7,8,9,1,B
r2,10,11,12,13,14,15,16,17,18,19,2,B
r3,20,21,22,23,24,25,26,27,28,29,3,B
r4,30,31,32,33,34,35,36,37,38,39,4,B
r5,40,41,42,43,44,45,46,47,48,49,5,B
r6,50,51,52,53,54,55,56,57,58,59,6,B
r7,60,61,62,63,64,65,66,67,68,69,7,B
r8,70,71,72,73,74,75,76,77,78,79,8,B
r9,80,81,82,83,84,85,86,87,88,89,9,B
r10,90,91,92,93,94,95,96,97,98,99,10,B


In [31]:
df['new2'] = 'B'

In [34]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,new,new2
r1,0,1,2,3,4,5,6,7,8,9,1,B
r2,10,11,12,13,14,15,16,17,18,19,2,B
r3,20,21,22,23,24,25,26,27,28,29,3,B
r4,30,31,32,33,34,35,36,37,38,39,4,B
r5,40,41,42,43,44,45,46,47,48,49,5,B
r6,50,51,52,53,54,55,56,57,58,59,6,B
r7,60,61,62,63,64,65,66,67,68,69,7,B
r8,70,71,72,73,74,75,76,77,78,79,8,B
r9,80,81,82,83,84,85,86,87,88,89,9,B
r10,90,91,92,93,94,95,96,97,98,99,10,B


In [35]:
new_col = ['a', 'b','D','a', 'b','D','a', 'b','D','S']

df.insert(loc = 4, column = 'test', value = new_col)

df

Unnamed: 0,c1,c2,c3,c4,test,c5,c6,c7,c8,c9,c10,new,new2
r1,0,1,2,3,a,4,5,6,7,8,9,1,B
r2,10,11,12,13,b,14,15,16,17,18,19,2,B
r3,20,21,22,23,D,24,25,26,27,28,29,3,B
r4,30,31,32,33,a,34,35,36,37,38,39,4,B
r5,40,41,42,43,b,44,45,46,47,48,49,5,B
r6,50,51,52,53,D,54,55,56,57,58,59,6,B
r7,60,61,62,63,a,64,65,66,67,68,69,7,B
r8,70,71,72,73,b,74,75,76,77,78,79,8,B
r9,80,81,82,83,D,84,85,86,87,88,89,9,B
r10,90,91,92,93,S,94,95,96,97,98,99,10,B


### Deleting the column -- `drop()`

        *df.drop('new')-- ValueError: labels ['new'] not contained in axis

Shift+tab, you see the default axis is 0, which refers to the index (row labels), for column, we need to specify axis = 1.<br>
&#9758; rows refer to 0 axis and columns refers to 1 axis<br> 
&#9758; Quick Check: *df.shape gives tuple (rows, cols) at [0] and [1]*

In [36]:
df.drop('r1', axis=0)

Unnamed: 0,c1,c2,c3,c4,test,c5,c6,c7,c8,c9,c10,new,new2
r2,10,11,12,13,b,14,15,16,17,18,19,2,B
r3,20,21,22,23,D,24,25,26,27,28,29,3,B
r4,30,31,32,33,a,34,35,36,37,38,39,4,B
r5,40,41,42,43,b,44,45,46,47,48,49,5,B
r6,50,51,52,53,D,54,55,56,57,58,59,6,B
r7,60,61,62,63,a,64,65,66,67,68,69,7,B
r8,70,71,72,73,b,74,75,76,77,78,79,8,B
r9,80,81,82,83,D,84,85,86,87,88,89,9,B
r10,90,91,92,93,S,94,95,96,97,98,99,10,B


In [37]:
df

Unnamed: 0,c1,c2,c3,c4,test,c5,c6,c7,c8,c9,c10,new,new2
r1,0,1,2,3,a,4,5,6,7,8,9,1,B
r2,10,11,12,13,b,14,15,16,17,18,19,2,B
r3,20,21,22,23,D,24,25,26,27,28,29,3,B
r4,30,31,32,33,a,34,35,36,37,38,39,4,B
r5,40,41,42,43,b,44,45,46,47,48,49,5,B
r6,50,51,52,53,D,54,55,56,57,58,59,6,B
r7,60,61,62,63,a,64,65,66,67,68,69,7,B
r8,70,71,72,73,b,74,75,76,77,78,79,8,B
r9,80,81,82,83,D,84,85,86,87,88,89,9,B
r10,90,91,92,93,S,94,95,96,97,98,99,10,B


In [None]:
df.drop('C33', axis=1)

In [None]:
df

In [42]:
df.drop('test', axis=1, inplace=True)

&#9758; Is the "new" really deleted? <br>
Output df and you will see "new" is still there!<br>

In [43]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [44]:
df.loc['R11'] = [90,91,92,93,94,95,96,97,98,99]

In [45]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [46]:
df.drop('R11', axis=0, inplace=True)

In [47]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


To delete the column, you have to tell the pandas by setting<br>
* ***inplace = True*** (default is inplace=False).<br>

&#9989; *pandas is generous, it does not want us to lose the information by any mistake and needs inplace*

### Rows
We can retrieve a row by its name or position with **[`loc`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)** and **[`iloc`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html)**.<br>
**loc** -- Access a group of rows and columns by label(s)

In [None]:
df.loc[['r2','r3', 'r5']]

Using row's index location with **iloc**, even if our index is labeled.

In [None]:
df.loc[['r2']]

In [None]:
df.iloc[[1]] # iloc[index], index based location

In [None]:
# more than one rows -- pass a list of rows!
df.loc[['r1','r2', 'r3']]  

### Grabbing an element or a sub-set of the dataframe

In [48]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [49]:
df.iloc[0:3, 0:7]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7
r1,0,1,2,3,4,5,6
r2,10,11,12,13,14,15,16
r3,20,21,22,23,24,25,26


In [50]:
df.loc['r1','c1']

0

In [51]:
df.loc[['r1','r2'],['c1','c2']]

Unnamed: 0,c1,c2
r1,0,1
r2,10,11


In [53]:
# another example - random columns and rows in the list 
df.loc[['r2','r5'],['c3','c4']]

Unnamed: 0,c3,c4
r2,12,13
r5,42,43


In [52]:
df  

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [54]:
# We can do a conditional selection as well
df > 5   

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,False,False,False,False,False,False,True,True,True,True
r2,True,True,True,True,True,True,True,True,True,True
r3,True,True,True,True,True,True,True,True,True,True
r4,True,True,True,True,True,True,True,True,True,True
r5,True,True,True,True,True,True,True,True,True,True
r6,True,True,True,True,True,True,True,True,True,True
r7,True,True,True,True,True,True,True,True,True,True
r8,True,True,True,True,True,True,True,True,True,True
r9,True,True,True,True,True,True,True,True,True,True
r10,True,True,True,True,True,True,True,True,True,True


In [55]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [57]:
bool_mask = df % 3 == 0
#bool_mask
df[bool_mask]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0.0,,,3.0,,,6.0,,,9.0
r2,,,12.0,,,15.0,,,18.0,
r3,,21.0,,,24.0,,,27.0,,
r4,30.0,,,33.0,,,36.0,,,39.0
r5,,,42.0,,,45.0,,,48.0,
r6,,51.0,,,54.0,,,57.0,,
r7,60.0,,,63.0,,,66.0,,,69.0
r8,,,72.0,,,75.0,,,78.0,
r9,,81.0,,,84.0,,,87.0,,
r10,90.0,,,93.0,,,96.0,,,99.0


&#9758; Its not common to use such operation on entire dataframe. We usually use them on a columns or rows instead.<br>
**For example, we don't want a row with NaN values.**<br>
What to do?<br>
Let's have a look at one example.

In [58]:
df  # Select * from df where c1 > 11   

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


Let's apply a condition on column c1, say `c1 > 11`<br>
based on the conditional selection, the out put will be:

In [59]:
df[df['c1']>11]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [60]:
df.query('c1 > 11')

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [63]:
df.query('c1 > 11')[['c1', 'c2']]

Unnamed: 0,c1,c2
r3,20,21
r4,30,31
r5,40,41
r6,50,51
r7,60,61
r8,70,71
r9,80,81
r10,90,91


In [62]:
#select c1, c2 from df where c1 > 11
df[df['c1'] > 11][['c1', 'c2']]

Unnamed: 0,c1,c2
r3,20,21
r4,30,31
r5,40,41
r6,50,51
r7,60,61
r8,70,71
r9,80,81
r10,90,91


We don't want `r1` and `r2` as they return NaN or null values. <br>
Let's filter the rows based on condition on column values.

In [64]:
df[df['c1']>11][['c3','c10']]

Unnamed: 0,c3,c10
r3,22,29
r4,32,39
r5,42,49
r6,52,59
r7,62,69
r8,72,79
r9,82,89
r10,92,99


&#9758; The above, **"`df[df['c1']>11]`"** is a dataframe with applied condition, we can select any col from this dataframe.<br> For example:

In [65]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [66]:
result = df[(df['c1']>11) & (df['c1']<80)]
result

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79


We can do the above operations, (filtering and selecting a columns) in a single line (stack commonds). 


In [67]:
df[df['c1']>11] #Select c1, c9 from df where c1 > 11

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [70]:
df[df['c1']>11].loc[['r3','r4']]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39


In [71]:
result = df[df['c1']==70] # select * from df where c1 = 70
result

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r8,70,71,72,73,74,75,76,77,78,79


### Combine 2 conditions 
Let's try on c1 for a value > 60 and on c2 for a value > 80

In [72]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [73]:
df[(df['c1']>60) & (df['c2']>80)] 

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [74]:
df[(df['c1']>60) | (df['c2']>80)] 

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


&#9989;**NOTE:**<br>
"and" operator will not work in the above condition and using "and" will return <br>

        *ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

This "ambiguous" means, True, only work for a single booleans at a time "True and False". We need to use "&" instead. ("|" for or)<br>
Try the above code using "and" <br>
The "and" operator gets confused with series of True/False and raise Error

### Let's have a quick look on couple of useful methods.
***We will explore more later on in the course!***

**`reset_index()`** and **`set_index()`**<br>
We can reset the index of our dataframe to numerical index (which is default index), `inplace = True` to make the permanent change. *The existing index will be a new column.*

In [75]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [76]:
df.reset_index(inplace = True)

In [77]:
df

Unnamed: 0,index,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
0,r1,0,1,2,3,4,5,6,7,8,9
1,r2,10,11,12,13,14,15,16,17,18,19
2,r3,20,21,22,23,24,25,26,27,28,29
3,r4,30,31,32,33,34,35,36,37,38,39
4,r5,40,41,42,43,44,45,46,47,48,49
5,r6,50,51,52,53,54,55,56,57,58,59
6,r7,60,61,62,63,64,65,66,67,68,69
7,r8,70,71,72,73,74,75,76,77,78,79
8,r9,80,81,82,83,84,85,86,87,88,89
9,r10,90,91,92,93,94,95,96,97,98,99


In [78]:
df.set_index('c2', inplace = True)
df

Unnamed: 0_level_0,index,c1,c3,c4,c5,c6,c7,c8,c9,c10
c2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,r1,0,2,3,4,5,6,7,8,9
11,r2,10,12,13,14,15,16,17,18,19
21,r3,20,22,23,24,25,26,27,28,29
31,r4,30,32,33,34,35,36,37,38,39
41,r5,40,42,43,44,45,46,47,48,49
51,r6,50,52,53,54,55,56,57,58,59
61,r7,60,62,63,64,65,66,67,68,69
71,r8,70,72,73,74,75,76,77,78,79
81,r9,80,82,83,84,85,86,87,88,89
91,r10,90,92,93,94,95,96,97,98,99


** consider, We have a column in our data that could be a useful index,<br>
we want to set that column as an index!**<br>

In [None]:
array_2d

In [None]:
col

In [80]:
df = pd.DataFrame(data = array_2d, index = ind, columns = col)
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [None]:
abc = 'a b c d e f g h i j'.split() # split at white spaces
df['newind']=abc
df

In [None]:
# setting newind as an index, needs to be inplaced
df.set_index('newind', inplace = True)

In [None]:
df

### `head()`, `tail()`

In [90]:
# Returns first n rows
df.head() # n = 5 by default 

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49


In [92]:
# Returns last n rows
df.tail(2) # n = 5 by default

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


### `info()`
Provides a concise summary of the DataFrame.

In [93]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [94]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, r1 to r10
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   c1      10 non-null     int32
 1   c2      10 non-null     int32
 2   c3      10 non-null     int32
 3   c4      10 non-null     int32
 4   c5      10 non-null     int32
 5   c6      10 non-null     int32
 6   c7      10 non-null     int32
 7   c8      10 non-null     int32
 8   c9      10 non-null     int32
 9   c10     10 non-null     int32
dtypes: int32(10)
memory usage: 480.0+ bytes


### `describe()`
Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding `NaN` values.

In [95]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [98]:
df.describe()

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,45.0,46.0,47.0,48.0,49.0,50.0,51.0,52.0,53.0,54.0
std,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504
min,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
25%,22.5,23.5,24.5,25.5,26.5,27.5,28.5,29.5,30.5,31.5
50%,45.0,46.0,47.0,48.0,49.0,50.0,51.0,52.0,53.0,54.0
75%,67.5,68.5,69.5,70.5,71.5,72.5,73.5,74.5,75.5,76.5
max,90.0,91.0,92.0,93.0,94.0,95.0,96.0,97.0,98.0,99.0


# Excellent! 
I want to congratulate here, you are making a great progress, keep it up!

In [99]:
df2 = pd.read_csv('E:\Breast_Cancer_Diagnostic.csv')

In [100]:
df2

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,diagnosis
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,M
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,M
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,M
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,M
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,M
...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,M
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,M
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,M
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,M


## nrows

In [None]:
df2 = pd.read_csv('E:\Breast_Cancer_Diagnostic.csv', nrows=100)
df2

In [None]:
df2.info()

In [None]:
df2.info(verbose=False)

In [None]:
df2.describe(include = 'all')

In [None]:
df2

In [None]:
pd.set_option('display.max_rows', None)

In [None]:
df2

In [None]:
df2.describe()