# pandas Data Structures:
We have learned about **Series**, lets learn DataFrames (2<sup>nd</sup> workhorse of pandas) to expand our concepts of Series.

* DataFrame
* Grab data (column wise)
* Grab data (raw wise)
* Grabbing an element or a sub-set of the dataframe
* Adding new column
* Deleting the column
* boolean_mask
* boolean_mask(Combine 2 conditions)
* reset_index(), set_index(), head(), tail(), info(), describe()

## DataFrame
* A very simple way to think about the DataFrame is, "bunch of Series together such as they share the same index". <br>
* A DataFrams is a rectangular table of data that contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc). DataFrame has both a row and column index; it can be thought of as a dictionary of Series all sharing the same index. <br>

&#9758; *A good read for those, who are interested! ([Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do))<br>*

Let's learn **DataFrame with examples:**<br>

In [1]:
import pandas as pd
import numpy as np

In [2]:
array_2d = np.arange(0, 100).reshape(10,10)

In [3]:
array_2d

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
       [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
       [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
       [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
       [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])

Let's create two labels/indexes:
* for rows 'r1 to r10'
* for columns 'c1 to c10'

Let's start with a simple example, using **`arange()`** and **`reshape()`** together to create a 2D array (matrix).<br>

In [4]:
ind = 'r1 r2 r3 r4 r5 r6 r7 r8 r9 r10'.split()  #['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10']
col = 'c1 c2 c3 c4 c5 c6 c7 c8 c9 c10'.split()

&#9989; *Use **TAB** for auto-complete and **shift + TAB**  for doc.*

In [5]:
# How the index, columns and array_2d look like!
ind

['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10']

In [6]:
col

['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10']

In [7]:
df = pd.DataFrame(data = array_2d, index=ind, columns=col)

In [8]:
df # select c1, c2 from df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


**df** is our first dataframe. <br>
We have columns, c1 to c10, and their corresponding rows, r1 to r10. <br>
Each column is actually a pandas series, sharing a common index (row labels). <br>

&#9758; Let's learn how to **Grab data** that we need, this is the most important thing we want to learn to move one!<br>

### Columns

In [9]:
df[['c1', 'c2']]

Unnamed: 0,c1,c2
r1,0,1
r2,10,11
r3,20,21
r4,30,31
r5,40,41
r6,50,51
r7,60,61
r8,70,71
r9,80,81
r10,90,91


In [10]:
df[['c1', 'c2', 'c5']] #select c1, c2 from df

Unnamed: 0,c1,c2,c5
r1,0,1,4
r2,10,11,14
r3,20,21,24
r4,30,31,34
r5,40,41,44
r6,50,51,54
r7,60,61,64
r8,70,71,74
r9,80,81,84
r10,90,91,94


In [11]:
type(df['c10'])

In [12]:
# Grabbing more than one column, pass the list of columns you need!
df[['c1', 'c10', 'c3']] #select c1, c10, c3 from df

Unnamed: 0,c1,c10,c3
r1,0,9,2
r2,10,19,12
r3,20,29,22
r4,30,39,32
r5,40,49,42
r6,50,59,52
r7,60,69,62
r8,70,79,72
r9,80,89,82
r10,90,99,92


**df.column_name (e.g. df.c1, df.c2 etc)** can be used to grab a column as well, its good to know but I don't recommend. <br>
If you press "TAB" after df., you will see lots of available methods, its good not to get confused with these option by using df.column_name.<br>
**Let's try this once**

In [13]:
df.c5
#df['c5']

Unnamed: 0,c5
r1,4
r2,14
r3,24
r4,34
r5,44
r6,54
r7,64
r8,74
r9,84
r10,94


### Adding new column
Lets try with "+" operation!

In [14]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [15]:
df['new']= df['c1'] + df['c2']  # select *, (c1 + c2) as new from df

In [16]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,new
r1,0,1,2,3,4,5,6,7,8,9,1
r2,10,11,12,13,14,15,16,17,18,19,21
r3,20,21,22,23,24,25,26,27,28,29,41
r4,30,31,32,33,34,35,36,37,38,39,61
r5,40,41,42,43,44,45,46,47,48,49,81
r6,50,51,52,53,54,55,56,57,58,59,101
r7,60,61,62,63,64,65,66,67,68,69,121
r8,70,71,72,73,74,75,76,77,78,79,141
r9,80,81,82,83,84,85,86,87,88,89,161
r10,90,91,92,93,94,95,96,97,98,99,181


In [17]:
df['new'] = 'A'
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,new
r1,0,1,2,3,4,5,6,7,8,9,A
r2,10,11,12,13,14,15,16,17,18,19,A
r3,20,21,22,23,24,25,26,27,28,29,A
r4,30,31,32,33,34,35,36,37,38,39,A
r5,40,41,42,43,44,45,46,47,48,49,A
r6,50,51,52,53,54,55,56,57,58,59,A
r7,60,61,62,63,64,65,66,67,68,69,A
r8,70,71,72,73,74,75,76,77,78,79,A
r9,80,81,82,83,84,85,86,87,88,89,A
r10,90,91,92,93,94,95,96,97,98,99,A


In [18]:
df['new2'] = 'B'

In [19]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,new,new2
r1,0,1,2,3,4,5,6,7,8,9,A,B
r2,10,11,12,13,14,15,16,17,18,19,A,B
r3,20,21,22,23,24,25,26,27,28,29,A,B
r4,30,31,32,33,34,35,36,37,38,39,A,B
r5,40,41,42,43,44,45,46,47,48,49,A,B
r6,50,51,52,53,54,55,56,57,58,59,A,B
r7,60,61,62,63,64,65,66,67,68,69,A,B
r8,70,71,72,73,74,75,76,77,78,79,A,B
r9,80,81,82,83,84,85,86,87,88,89,A,B
r10,90,91,92,93,94,95,96,97,98,99,A,B


In [20]:
new_col = 'D'

df.insert(loc = 4, column = 'C33', value = new_col)

df

Unnamed: 0,c1,c2,c3,c4,C33,c5,c6,c7,c8,c9,c10,new,new2
r1,0,1,2,3,D,4,5,6,7,8,9,A,B
r2,10,11,12,13,D,14,15,16,17,18,19,A,B
r3,20,21,22,23,D,24,25,26,27,28,29,A,B
r4,30,31,32,33,D,34,35,36,37,38,39,A,B
r5,40,41,42,43,D,44,45,46,47,48,49,A,B
r6,50,51,52,53,D,54,55,56,57,58,59,A,B
r7,60,61,62,63,D,64,65,66,67,68,69,A,B
r8,70,71,72,73,D,74,75,76,77,78,79,A,B
r9,80,81,82,83,D,84,85,86,87,88,89,A,B
r10,90,91,92,93,D,94,95,96,97,98,99,A,B


### Deleting the column -- `drop()`

        *df.drop('new')-- ValueError: labels ['new'] not contained in axis

Shift+tab, you see the default axis is 0, which refers to the index (row labels), for column, we need to specify axis = 1.<br>
&#9758; rows refer to 0 axis and columns refers to 1 axis<br>
&#9758; Quick Check: *df.shape gives tuple (rows, cols) at [0] and [1]*

In [21]:
df.drop('r1', axis=0)

Unnamed: 0,c1,c2,c3,c4,C33,c5,c6,c7,c8,c9,c10,new,new2
r2,10,11,12,13,D,14,15,16,17,18,19,A,B
r3,20,21,22,23,D,24,25,26,27,28,29,A,B
r4,30,31,32,33,D,34,35,36,37,38,39,A,B
r5,40,41,42,43,D,44,45,46,47,48,49,A,B
r6,50,51,52,53,D,54,55,56,57,58,59,A,B
r7,60,61,62,63,D,64,65,66,67,68,69,A,B
r8,70,71,72,73,D,74,75,76,77,78,79,A,B
r9,80,81,82,83,D,84,85,86,87,88,89,A,B
r10,90,91,92,93,D,94,95,96,97,98,99,A,B


In [22]:
df

Unnamed: 0,c1,c2,c3,c4,C33,c5,c6,c7,c8,c9,c10,new,new2
r1,0,1,2,3,D,4,5,6,7,8,9,A,B
r2,10,11,12,13,D,14,15,16,17,18,19,A,B
r3,20,21,22,23,D,24,25,26,27,28,29,A,B
r4,30,31,32,33,D,34,35,36,37,38,39,A,B
r5,40,41,42,43,D,44,45,46,47,48,49,A,B
r6,50,51,52,53,D,54,55,56,57,58,59,A,B
r7,60,61,62,63,D,64,65,66,67,68,69,A,B
r8,70,71,72,73,D,74,75,76,77,78,79,A,B
r9,80,81,82,83,D,84,85,86,87,88,89,A,B
r10,90,91,92,93,D,94,95,96,97,98,99,A,B


In [23]:
df.drop('C33', axis=1)

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,new,new2
r1,0,1,2,3,4,5,6,7,8,9,A,B
r2,10,11,12,13,14,15,16,17,18,19,A,B
r3,20,21,22,23,24,25,26,27,28,29,A,B
r4,30,31,32,33,34,35,36,37,38,39,A,B
r5,40,41,42,43,44,45,46,47,48,49,A,B
r6,50,51,52,53,54,55,56,57,58,59,A,B
r7,60,61,62,63,64,65,66,67,68,69,A,B
r8,70,71,72,73,74,75,76,77,78,79,A,B
r9,80,81,82,83,84,85,86,87,88,89,A,B
r10,90,91,92,93,94,95,96,97,98,99,A,B


In [24]:
df

Unnamed: 0,c1,c2,c3,c4,C33,c5,c6,c7,c8,c9,c10,new,new2
r1,0,1,2,3,D,4,5,6,7,8,9,A,B
r2,10,11,12,13,D,14,15,16,17,18,19,A,B
r3,20,21,22,23,D,24,25,26,27,28,29,A,B
r4,30,31,32,33,D,34,35,36,37,38,39,A,B
r5,40,41,42,43,D,44,45,46,47,48,49,A,B
r6,50,51,52,53,D,54,55,56,57,58,59,A,B
r7,60,61,62,63,D,64,65,66,67,68,69,A,B
r8,70,71,72,73,D,74,75,76,77,78,79,A,B
r9,80,81,82,83,D,84,85,86,87,88,89,A,B
r10,90,91,92,93,D,94,95,96,97,98,99,A,B


In [25]:
df.drop('new2', axis=1, inplace=True)

In [26]:
df.drop('new', axis=1, inplace=True)

In [27]:
df.drop('C33', axis=1, inplace=True)

&#9758; Is the "new" really deleted? <br>
Output df and you will see "new" is still there!<br>

In [28]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [29]:
df.loc['R11'] = [90,91,92,93,94,95,96,97,98,99]

In [30]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [31]:
df.drop('R11', axis=0, inplace=True)

In [32]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


To delete the column, you have to tell the pandas by setting<br>
* ***inplace = True*** (default is inplace=False).<br>

&#9989; *pandas is generous, it does not want us to lose the information by any mistake and needs inplace*

### Rows
We can retrieve a row by its name or position with **[`loc`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)** and **[`iloc`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html)**.<br>
**loc** -- Access a group of rows and columns by label(s)

In [33]:
df.loc[['r2','r3', 'r5']]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r5,40,41,42,43,44,45,46,47,48,49


Using row's index location with **iloc**, even if our index is labeled.

In [34]:
df.loc[['r2']]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r2,10,11,12,13,14,15,16,17,18,19


In [35]:
df.iloc[[1]] # iloc[index], index based location

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r2,10,11,12,13,14,15,16,17,18,19


In [36]:
# more than one rows -- pass a list of rows!
df.loc[['r1','r2', 'r3']]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29


### Grabbing an element or a sub-set of the dataframe

In [37]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [38]:
df.loc['r1','c1']

0

In [39]:
df.loc[['r1','r2'],['c1','c2']]

Unnamed: 0,c1,c2
r1,0,1
r2,10,11


In [40]:
# another example - random columns and rows in the list
df.loc[['r2','r5'],['c3','c4']]

Unnamed: 0,c3,c4
r2,12,13
r5,42,43


In [41]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [42]:
# We can do a conditional selection as well
df > 5

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,False,False,False,False,False,False,True,True,True,True
r2,True,True,True,True,True,True,True,True,True,True
r3,True,True,True,True,True,True,True,True,True,True
r4,True,True,True,True,True,True,True,True,True,True
r5,True,True,True,True,True,True,True,True,True,True
r6,True,True,True,True,True,True,True,True,True,True
r7,True,True,True,True,True,True,True,True,True,True
r8,True,True,True,True,True,True,True,True,True,True
r9,True,True,True,True,True,True,True,True,True,True
r10,True,True,True,True,True,True,True,True,True,True


In [43]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [44]:
bool_mask = df % 3 == 0
#bool_mask
df[bool_mask]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0.0,,,3.0,,,6.0,,,9.0
r2,,,12.0,,,15.0,,,18.0,
r3,,21.0,,,24.0,,,27.0,,
r4,30.0,,,33.0,,,36.0,,,39.0
r5,,,42.0,,,45.0,,,48.0,
r6,,51.0,,,54.0,,,57.0,,
r7,60.0,,,63.0,,,66.0,,,69.0
r8,,,72.0,,,75.0,,,78.0,
r9,,81.0,,,84.0,,,87.0,,
r10,90.0,,,93.0,,,96.0,,,99.0


&#9758; Its not common to use such operation on entire dataframe. We usually use them on a columns or rows instead.<br>
**For example, we don't want a row with NaN values.**<br>
What to do?<br>
Let's have a look at one example.

In [45]:
df  # Select * from df where c1 > 11

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [46]:
df[df['c1']>11]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [47]:
#select c1, c2 from df where c1 > 11
df[df['c1'] > 11][['c1', 'c2']]

Unnamed: 0,c1,c2
r3,20,21
r4,30,31
r5,40,41
r6,50,51
r7,60,61
r8,70,71
r9,80,81
r10,90,91


We don't want `r1` and `r2` as they return NaN or null values. <br>
Let's filter the rows based on condition on column values.

In [48]:
df[df['c1']>11][['c3','c10']]

Unnamed: 0,c3,c10
r3,22,29
r4,32,39
r5,42,49
r6,52,59
r7,62,69
r8,72,79
r9,82,89
r10,92,99


&#9758; The above, **"`df[df['c1']>11]`"** is a dataframe with applied condition, we can select any col from this dataframe.<br> For example:

In [49]:
result = df[(df['c1']>11) & (df['c1']<80)]
result

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79


We can do the above operations, (filtering and selecting a columns) in a single line (stack commonds).


In [50]:
df[df['c1']>11] #Select c1, c9 from df where c1 > 11

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [51]:
df[df['c1']>11].loc[['r3','r4']]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39


In [52]:
result = df[df['c1']==70] # select * from df where c1 = 70
result

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r8,70,71,72,73,74,75,76,77,78,79


### Combine 2 conditions
Let's try on c1 for a value > 60 and on c2 for a value > 80

In [53]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [54]:
df[(df['c1']>60) & (df['c2']>80)]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [55]:
df[(df['c1']>60) | (df['c2']>80)]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


&#9989;**NOTE:**<br>
"and" operator will not work in the above condition and using "and" will return <br>

        *ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

This "ambiguous" means, True, only work for a single booleans at a time "True and False". We need to use "&" instead. ("|" for or)<br>
Try the above code using "and" <br>
The "and" operator gets confused with series of True/False and raise Error

### Let's have a quick look on couple of useful methods.
***We will explore more later on in the course!***

**`reset_index()`** and **`set_index()`**<br>
We can reset the index of our dataframe to numerical index (which is default index), `inplace = True` to make the permanent change. *The existing index will be a new column.*

In [56]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [57]:
df.reset_index(inplace = True)

In [58]:
df

Unnamed: 0,index,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
0,r1,0,1,2,3,4,5,6,7,8,9
1,r2,10,11,12,13,14,15,16,17,18,19
2,r3,20,21,22,23,24,25,26,27,28,29
3,r4,30,31,32,33,34,35,36,37,38,39
4,r5,40,41,42,43,44,45,46,47,48,49
5,r6,50,51,52,53,54,55,56,57,58,59
6,r7,60,61,62,63,64,65,66,67,68,69
7,r8,70,71,72,73,74,75,76,77,78,79
8,r9,80,81,82,83,84,85,86,87,88,89
9,r10,90,91,92,93,94,95,96,97,98,99


In [59]:
df.set_index('c2', inplace = True)
df

Unnamed: 0_level_0,index,c1,c3,c4,c5,c6,c7,c8,c9,c10
c2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,r1,0,2,3,4,5,6,7,8,9
11,r2,10,12,13,14,15,16,17,18,19
21,r3,20,22,23,24,25,26,27,28,29
31,r4,30,32,33,34,35,36,37,38,39
41,r5,40,42,43,44,45,46,47,48,49
51,r6,50,52,53,54,55,56,57,58,59
61,r7,60,62,63,64,65,66,67,68,69
71,r8,70,72,73,74,75,76,77,78,79
81,r9,80,82,83,84,85,86,87,88,89
91,r10,90,92,93,94,95,96,97,98,99


** consider, We have a column in our data that could be a useful index,<br>
we want to set that column as an index!**<br>

In [60]:
array_2d

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
       [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
       [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
       [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
       [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])

In [61]:
col

['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10']

In [62]:
df = pd.DataFrame(data = array_2d, index = ind, columns = col)
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [63]:
abc = 'a b c d e f g h i j'.split() # split at white spaces
df['newind']=abc
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,newind
r1,0,1,2,3,4,5,6,7,8,9,a
r2,10,11,12,13,14,15,16,17,18,19,b
r3,20,21,22,23,24,25,26,27,28,29,c
r4,30,31,32,33,34,35,36,37,38,39,d
r5,40,41,42,43,44,45,46,47,48,49,e
r6,50,51,52,53,54,55,56,57,58,59,f
r7,60,61,62,63,64,65,66,67,68,69,g
r8,70,71,72,73,74,75,76,77,78,79,h
r9,80,81,82,83,84,85,86,87,88,89,i
r10,90,91,92,93,94,95,96,97,98,99,j


In [64]:
# setting newind as an index, needs to be inplaced
df.set_index('newind', inplace = True)

In [65]:
df

Unnamed: 0_level_0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
newind,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
a,0,1,2,3,4,5,6,7,8,9
b,10,11,12,13,14,15,16,17,18,19
c,20,21,22,23,24,25,26,27,28,29
d,30,31,32,33,34,35,36,37,38,39
e,40,41,42,43,44,45,46,47,48,49
f,50,51,52,53,54,55,56,57,58,59
g,60,61,62,63,64,65,66,67,68,69
h,70,71,72,73,74,75,76,77,78,79
i,80,81,82,83,84,85,86,87,88,89
j,90,91,92,93,94,95,96,97,98,99


### `head()`, `tail()`

In [66]:
# Returns first n rows
df.head() # n = 5 by default

Unnamed: 0_level_0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
newind,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
a,0,1,2,3,4,5,6,7,8,9
b,10,11,12,13,14,15,16,17,18,19
c,20,21,22,23,24,25,26,27,28,29
d,30,31,32,33,34,35,36,37,38,39
e,40,41,42,43,44,45,46,47,48,49


In [67]:
# Returns last n rows
df.tail(2) # n = 5 by default

Unnamed: 0_level_0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
newind,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
i,80,81,82,83,84,85,86,87,88,89
j,90,91,92,93,94,95,96,97,98,99


### `info()`
Provides a concise summary of the DataFrame.

In [68]:
df

Unnamed: 0_level_0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
newind,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
a,0,1,2,3,4,5,6,7,8,9
b,10,11,12,13,14,15,16,17,18,19
c,20,21,22,23,24,25,26,27,28,29
d,30,31,32,33,34,35,36,37,38,39
e,40,41,42,43,44,45,46,47,48,49
f,50,51,52,53,54,55,56,57,58,59
g,60,61,62,63,64,65,66,67,68,69
h,70,71,72,73,74,75,76,77,78,79
i,80,81,82,83,84,85,86,87,88,89
j,90,91,92,93,94,95,96,97,98,99


In [69]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   c1      10 non-null     int64
 1   c2      10 non-null     int64
 2   c3      10 non-null     int64
 3   c4      10 non-null     int64
 4   c5      10 non-null     int64
 5   c6      10 non-null     int64
 6   c7      10 non-null     int64
 7   c8      10 non-null     int64
 8   c9      10 non-null     int64
 9   c10     10 non-null     int64
dtypes: int64(10)
memory usage: 880.0+ bytes


### `describe()`
Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding `NaN` values.

In [70]:
df

Unnamed: 0_level_0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
newind,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
a,0,1,2,3,4,5,6,7,8,9
b,10,11,12,13,14,15,16,17,18,19
c,20,21,22,23,24,25,26,27,28,29
d,30,31,32,33,34,35,36,37,38,39
e,40,41,42,43,44,45,46,47,48,49
f,50,51,52,53,54,55,56,57,58,59
g,60,61,62,63,64,65,66,67,68,69
h,70,71,72,73,74,75,76,77,78,79
i,80,81,82,83,84,85,86,87,88,89
j,90,91,92,93,94,95,96,97,98,99


In [71]:
df.describe()

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,45.0,46.0,47.0,48.0,49.0,50.0,51.0,52.0,53.0,54.0
std,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504
min,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
25%,22.5,23.5,24.5,25.5,26.5,27.5,28.5,29.5,30.5,31.5
50%,45.0,46.0,47.0,48.0,49.0,50.0,51.0,52.0,53.0,54.0
75%,67.5,68.5,69.5,70.5,71.5,72.5,73.5,74.5,75.5,76.5
max,90.0,91.0,92.0,93.0,94.0,95.0,96.0,97.0,98.0,99.0


# Excellent!
I want to congratulate here, you are making a great progress, keep it up!

In [72]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [73]:
location = "/content/drive/MyDrive/Business Automation Ltd./2. Code/Machine Learning/Topic 2: Data Analysis using - Pandas/"

In [74]:
df2 = pd.read_csv(location + 'Breast_Cancer_Diagnostic.csv')

In [75]:
df2

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,
1,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,
2,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,
4,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,
565,926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,
566,926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,
567,927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,


## nrows

In [76]:
df2 = pd.read_csv(location + 'Breast_Cancer_Diagnostic.csv', nrows=100)
df2

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.990,10.38,122.80,1001.0,0.11840,0.27760,0.300100,0.147100,...,17.33,184.60,2019.0,0.1622,0.66560,0.71190,0.26540,0.4601,0.11890,
1,842517,M,20.570,17.77,132.90,1326.0,0.08474,0.07864,0.086900,0.070170,...,23.41,158.80,1956.0,0.1238,0.18660,0.24160,0.18600,0.2750,0.08902,
2,84300903,M,19.690,21.25,130.00,1203.0,0.10960,0.15990,0.197400,0.127900,...,25.53,152.50,1709.0,0.1444,0.42450,0.45040,0.24300,0.3613,0.08758,
3,84348301,M,11.420,20.38,77.58,386.1,0.14250,0.28390,0.241400,0.105200,...,26.50,98.87,567.7,0.2098,0.86630,0.68690,0.25750,0.6638,0.17300,
4,84358402,M,20.290,14.34,135.10,1297.0,0.10030,0.13280,0.198000,0.104300,...,16.67,152.20,1575.0,0.1374,0.20500,0.40000,0.16250,0.2364,0.07678,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,86208,M,20.260,23.03,132.40,1264.0,0.09078,0.13130,0.146500,0.086830,...,31.59,156.10,1750.0,0.1190,0.35390,0.40980,0.15730,0.3689,0.08368,
96,86211,B,12.180,17.84,77.79,451.1,0.10450,0.07057,0.024900,0.029410,...,20.92,82.14,495.2,0.1140,0.09358,0.04980,0.05882,0.2227,0.07376,
97,862261,B,9.787,19.94,62.11,294.5,0.10240,0.05301,0.006829,0.007937,...,26.29,68.81,366.1,0.1316,0.09473,0.02049,0.02381,0.1934,0.08988,
98,862485,B,11.600,12.84,74.34,412.6,0.08983,0.07525,0.041960,0.033500,...,17.16,82.96,512.5,0.1431,0.18510,0.19220,0.08449,0.2772,0.08756,


In [77]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       100 non-null    int64  
 1   diagnosis                100 non-null    object 
 2   radius_mean              100 non-null    float64
 3   texture_mean             100 non-null    float64
 4   perimeter_mean           100 non-null    float64
 5   area_mean                100 non-null    float64
 6   smoothness_mean          100 non-null    float64
 7   compactness_mean         100 non-null    float64
 8   concavity_mean           100 non-null    float64
 9   concave points_mean      100 non-null    float64
 10  symmetry_mean            100 non-null    float64
 11  fractal_dimension_mean   100 non-null    float64
 12  radius_se                100 non-null    float64
 13  texture_se               100 non-null    float64
 14  perimeter_se             10

In [78]:
df2.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Columns: 33 entries, id to Unnamed: 32
dtypes: float64(31), int64(1), object(1)
memory usage: 25.9+ KB


In [79]:
df2.describe(include = 'all')

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
count,100.0,100,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,0.0
unique,,2,,,,,,,,,...,,,,,,,,,,
top,,M,,,,,,,,,...,,,,,,,,,,
freq,,65,,,,,,,,,...,,,,,,,,,,
mean,15470930.0,,14.70778,19.6922,96.4712,703.293,0.10204,0.126576,0.114752,0.063801,...,26.5999,116.6103,1008.86,0.142288,0.330832,0.356674,0.145522,0.323583,0.09217,
std,30665490.0,,3.349245,3.759176,23.187471,320.152301,0.013151,0.061123,0.078862,0.038104,...,5.701383,33.701712,546.008605,0.02294,0.204848,0.237003,0.068264,0.080613,0.023814,
min,85715.0,,8.196,10.38,51.71,201.9,0.07355,0.03766,0.000692,0.004167,...,12.49,57.26,242.2,0.09387,0.04619,0.001845,0.01111,0.1565,0.05504,
25%,854264.2,,12.4575,16.76,82.27,476.8,0.093402,0.080118,0.041847,0.02917,...,22.44,91.435,609.7,0.12705,0.177125,0.1724,0.079223,0.2716,0.076135,
50%,859373.5,,14.335,20.19,94.365,643.65,0.10115,0.11815,0.10715,0.06581,...,27.26,110.45,830.75,0.1431,0.2794,0.3156,0.15605,0.3046,0.08757,
75%,8610460.0,,17.155,22.15,114.4,916.875,0.110375,0.1568,0.16635,0.08759,...,30.885,140.25,1323.75,0.1576,0.4248,0.522075,0.190725,0.36935,0.103275,


In [80]:
df2

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.990,10.38,122.80,1001.0,0.11840,0.27760,0.300100,0.147100,...,17.33,184.60,2019.0,0.1622,0.66560,0.71190,0.26540,0.4601,0.11890,
1,842517,M,20.570,17.77,132.90,1326.0,0.08474,0.07864,0.086900,0.070170,...,23.41,158.80,1956.0,0.1238,0.18660,0.24160,0.18600,0.2750,0.08902,
2,84300903,M,19.690,21.25,130.00,1203.0,0.10960,0.15990,0.197400,0.127900,...,25.53,152.50,1709.0,0.1444,0.42450,0.45040,0.24300,0.3613,0.08758,
3,84348301,M,11.420,20.38,77.58,386.1,0.14250,0.28390,0.241400,0.105200,...,26.50,98.87,567.7,0.2098,0.86630,0.68690,0.25750,0.6638,0.17300,
4,84358402,M,20.290,14.34,135.10,1297.0,0.10030,0.13280,0.198000,0.104300,...,16.67,152.20,1575.0,0.1374,0.20500,0.40000,0.16250,0.2364,0.07678,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,86208,M,20.260,23.03,132.40,1264.0,0.09078,0.13130,0.146500,0.086830,...,31.59,156.10,1750.0,0.1190,0.35390,0.40980,0.15730,0.3689,0.08368,
96,86211,B,12.180,17.84,77.79,451.1,0.10450,0.07057,0.024900,0.029410,...,20.92,82.14,495.2,0.1140,0.09358,0.04980,0.05882,0.2227,0.07376,
97,862261,B,9.787,19.94,62.11,294.5,0.10240,0.05301,0.006829,0.007937,...,26.29,68.81,366.1,0.1316,0.09473,0.02049,0.02381,0.1934,0.08988,
98,862485,B,11.600,12.84,74.34,412.6,0.08983,0.07525,0.041960,0.033500,...,17.16,82.96,512.5,0.1431,0.18510,0.19220,0.08449,0.2772,0.08756,


In [81]:
pd.set_option('display.max_rows', None)

In [82]:
df2

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,
5,843786,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,...,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,
6,844359,M,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,...,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368,
7,84458202,M,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,...,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151,
8,844981,M,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,...,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072,
9,84501001,M,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,...,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075,


In [83]:
df2.describe()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,0.0
mean,15470930.0,14.70778,19.6922,96.4712,703.293,0.10204,0.126576,0.114752,0.063801,0.19307,...,26.5999,116.6103,1008.86,0.142288,0.330832,0.356674,0.145522,0.323583,0.09217,
std,30665490.0,3.349245,3.759176,23.187471,320.152301,0.013151,0.061123,0.078862,0.038104,0.030822,...,5.701383,33.701712,546.008605,0.02294,0.204848,0.237003,0.068264,0.080613,0.023814,
min,85715.0,8.196,10.38,51.71,201.9,0.07355,0.03766,0.000692,0.004167,0.135,...,12.49,57.26,242.2,0.09387,0.04619,0.001845,0.01111,0.1565,0.05504,
25%,854264.2,12.4575,16.76,82.27,476.8,0.093402,0.080118,0.041847,0.02917,0.172,...,22.44,91.435,609.7,0.12705,0.177125,0.1724,0.079223,0.2716,0.076135,
50%,859373.5,14.335,20.19,94.365,643.65,0.10115,0.11815,0.10715,0.06581,0.18955,...,27.26,110.45,830.75,0.1431,0.2794,0.3156,0.15605,0.3046,0.08757,
75%,8610460.0,17.155,22.15,114.4,916.875,0.110375,0.1568,0.16635,0.08759,0.208825,...,30.885,140.25,1323.75,0.1576,0.4248,0.522075,0.190725,0.36935,0.103275,
max,86135500.0,25.22,27.54,171.5,1878.0,0.1425,0.3454,0.3754,0.1845,0.304,...,40.68,211.7,2615.0,0.2098,1.058,1.252,0.2867,0.6638,0.2075,
