# Pandas tutorial : Day 3

Selecting, Slicing and Filtering data in a Pandas DataFrame

* [Selecting](#1)
 1. [Select rows and columns using labels](#2)
   * [To select a single column](#3)
   * [To select multiple columns](#4)
   * [Select a row by it's label](#5)
   * [Select multiple row by it's label](#6)
   * [Accessing values by row label and column name](#7)
   * [Accessing values from multiple columns of same row](#8)
   * [Accessing values from multiple rows but same columns](#9)
   * [Accessing values from multiple rows and multiple columns](#10)
 2. [Select by index position](#11)
   * [Select a row by index location](#12)
   * [Select a column by index location](#13)
   * [Select data at specified row and column location](#14)
   * [Select multiple rows and columns](#15)
 3. [Selecting top n largest values of given column](#16)
 4. [Selecting top n samllest values of given column](#17)
 5. [Selecting random sample from the dataset](#18)
 6. [Conditional selection of columns](#19)
* [Slicing](#20)
 1. [Slicing rows and columns using labels.](#21)
   * [Slice row by label](#22)
   * [Slice columns by label](#23)
   * [Slice row and columns by label](#24)
 2. [Slicing rows and columns by position.](#25)
   * [To slice rows by index position](#26)
   * [To slice columns by index position](#27)
   * [To slice row and columns by index position](#28)
* [Subsetting by boolean conditions](#29)
 1. [Select rows based on column value](#30)
   * [To select all rows whose column contain the specified value(s)](#31)
   * [Rows that match multiple column conditions](#32)
   * [Select rows whose column DOES NOT contain specified values](#33)
 2. [Select columns based on row value](#34)
 3. [Subsetting using filter method](#35)
 
 
Let's gets started!
 
[Data for daily news for stock market prediction](https://www.kaggle.com/aaron7sun/stocknews)

In [1]:
# import library
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/stocknews/Combined_News_DJIA.csv
/kaggle/input/stocknews/upload_DJIA_table.csv
/kaggle/input/stocknews/RedditNews.csv


In [2]:
# import data
df = pd.read_csv('/kaggle/input/stocknews/upload_DJIA_table.csv')

In [3]:
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141
1,2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234
2,2016-06-29,17456.019531,17704.509766,17456.019531,17694.679688,106380000,17694.679688
3,2016-06-28,17190.509766,17409.720703,17190.509766,17409.720703,112190000,17409.720703
4,2016-06-27,17355.210938,17355.210938,17063.080078,17140.240234,138740000,17140.240234


# Selecting<a id='1'></a>

## 1. Select rows and columns using labels<a id='2'></a>
You can select rows and columns in a Pandas DataFrame by using their corresponding labels.

### 1.1 To select a single column<a id='3'></a>
Syntax 1 : `df.loc[:, 'column_name']`

Syntax 2 : `df['column_name']`

Syntax 3 : `df.column_name`

In [31]:
df.loc[:, 'Open']

0       17924.240234
1       17712.759766
2       17456.019531
3       17190.509766
4       17355.210938
            ...     
1984    11532.070312
1985    11632.809570
1986    11781.700195
1987    11729.669922
1988    11432.089844
Name: Open, Length: 1989, dtype: float64

In [30]:
df['Open']

0       17924.240234
1       17712.759766
2       17456.019531
3       17190.509766
4       17355.210938
            ...     
1984    11532.070312
1985    11632.809570
1986    11781.700195
1987    11729.669922
1988    11432.089844
Name: Open, Length: 1989, dtype: float64

In [29]:
df.Open

0       17924.240234
1       17712.759766
2       17456.019531
3       17190.509766
4       17355.210938
            ...     
1984    11532.070312
1985    11632.809570
1986    11781.700195
1987    11729.669922
1988    11432.089844
Name: Open, Length: 1989, dtype: float64

### 1.2 To select multiple columns<a id='4'></a>
Syntax 1 : `df.loc[:, ['column1', 'column2', ...]]`

Syntax 2 : `df[['column1', 'column2', ...]]`

In [28]:
df.loc[:, ['Open', 'Close']]

Unnamed: 0,Open,Close
0,17924.240234,17949.369141
1,17712.759766,17929.990234
2,17456.019531,17694.679688
3,17190.509766,17409.720703
4,17355.210938,17140.240234
...,...,...
1984,11532.070312,11615.929688
1985,11632.809570,11532.959961
1986,11781.700195,11642.469727
1987,11729.669922,11782.349609


In [27]:
df[['Open', 'Close']]

Unnamed: 0,Open,Close
0,17924.240234,17949.369141
1,17712.759766,17929.990234
2,17456.019531,17694.679688
3,17190.509766,17409.720703
4,17355.210938,17140.240234
...,...,...
1984,11532.070312,11615.929688
1985,11632.809570,11532.959961
1986,11781.700195,11642.469727
1987,11729.669922,11782.349609


### 1.3 Select a row by it's label<a id='5'></a>
Syntax : `df.loc[row_label]`

In [9]:
df.loc[0]

Date         2016-07-01
Open            17924.2
High            18002.4
Low             17916.9
Close           17949.4
Volume         82160000
Adj Close       17949.4
Name: 0, dtype: object

### 1.4 Select multiple row by it's label<a id='6'></a>
Syntax : `df.loc[[row_label1, row_label2, ...]]`

In [10]:
df.loc[[0,1,10]]

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141
1,2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234
10,2016-06-17,17733.439453,17733.439453,17602.779297,17675.160156,248680000,17675.160156


### 1.5 Accessing values by row label and column name<a id='7'></a>
Syntax : `df.loc[row_label, 'column_name']`

In [11]:
df.loc[0, 'Open']

17924.240234

### 1.6 Accessing values from multiple columns of same row<a id='8'></a>
Syntax : `df.loc[row_label, ['column_name1', 'column_name2']]`

In [12]:
df.loc[1, ['Open', 'Close']]

Open     17712.8
Close      17930
Name: 1, dtype: object

### 1.7 Accessing values from multiple rows but same columns<a id='9'></a>
Syntax : `df.loc[[row_label1, row_label2], 'column_name']`

In [13]:
df.loc[[0, 1], ['Open']]

Unnamed: 0,Open
0,17924.240234
1,17712.759766


### 1.8 Accessing values from multiple rows and multiple columns<a id='10'></a>
Syntax : `df.loc[[row_label1, row_label2, ...], ['column_name1, column_name2, ...']]`

In [14]:
df.loc[[0, 1], ['Open', 'Close']]

Unnamed: 0,Open,Close
0,17924.240234,17949.369141
1,17712.759766,17929.990234


## 2. Select by index position<a id='11'></a>
You can select data from a Pandas DataFrame by its location. Note, Pandas indexing starts from zero.

### 2.1 Select a row by index location<a id='12'></a>
Syntax : `df.iloc[index]`

In [15]:
df.iloc[0]

Date         2016-07-01
Open            17924.2
High            18002.4
Low             17916.9
Close           17949.4
Volume         82160000
Adj Close       17949.4
Name: 0, dtype: object

### 2.2 Select a column by index location<a id='13'></a>
Syntax : `df.iloc[:, index]`

In [16]:
df.iloc[:, 5]

0        82160000
1       133030000
2       106380000
3       112190000
4       138740000
          ...    
1984    159790000
1985    182550000
1986    173590000
1987    183190000
1988    212830000
Name: Volume, Length: 1989, dtype: int64

### 2.3 Select data at specified row and column location<a id='14'></a>
Syntax : `df.iloc[row_index, column_index]`

In [17]:
df.iloc[0, 0]

'2016-07-01'

### 2.4 Select multiple rows and columns<a id='15'></a>
Syntax : `df.iloc[[row_index1, row_index2, ...], [column_index1, column_index2, ...]]`

In [18]:
df.iloc[[0, 1, 3], [0, 1]]

Unnamed: 0,Date,Open
0,2016-07-01,17924.240234
1,2016-06-30,17712.759766
3,2016-06-28,17190.509766


## 3. Selecting top n largest values of given column<a id='16'></a>
Syntax : `df.nlargest(n, 'column_name')`

In [66]:
df.nlargest(3,'Open') 

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
282,2015-05-20,18315.060547,18350.130859,18272.560547,18285.400391,80190000,18285.400391
283,2015-05-19,18300.480469,18351.359375,18261.349609,18312.390625,87200000,18312.390625
280,2015-05-22,18286.869141,18286.869141,18217.140625,18232.019531,78890000,18232.019531


## 4. Selecting top n smallest values of given column<a id='17'></a>
Syntax : `df.nsmallest(n, 'column_name')`

In [68]:
df.nsmallest(3,'Open') 

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
1842,2009-03-10,6547.009766,6926.490234,6546.609863,6926.490234,640020000,6926.490234
1844,2009-03-06,6595.160156,6755.169922,6469.950195,6626.939941,425170000,6626.939941
1843,2009-03-09,6625.740234,6709.609863,6516.859863,6547.049805,365990000,6547.049805


## 5. Selecting random sample from the dataset<a id='18'></a>
Syntax 1 : `df.sample(n)` 

Syntax 2 : `df.sample(frac = n)` 

In [69]:
df.sample(3)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
1532,2010-06-02,10025.610352,10254.219727,10025.610352,10249.540039,200850000,10249.540039
1647,2009-12-15,10499.30957,10499.30957,10426.69043,10452.0,187560000,10452.0
410,2014-11-13,17618.689453,17705.480469,17583.880859,17652.789062,80540000,17652.789062


In [77]:
df.sample(frac = 0.3)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
1914,2008-11-21,7552.370117,8071.750000,7449.379883,8046.419922,569010000,8046.419922
1631,2010-01-08,10606.400391,10619.400391,10554.330078,10618.190430,172710000,10618.190430
885,2012-12-26,13138.849609,13174.879883,13076.870117,13114.589844,79410000,13114.589844
1812,2009-04-22,7964.779785,8044.830078,7868.009766,7886.569824,387030000,7886.569824
465,2014-08-27,17111.029297,17134.599609,17090.609375,17122.009766,61690000,17122.009766
...,...,...,...,...,...,...,...
334,2015-03-06,18135.720703,18135.720703,17825.150391,17856.779297,113350000,17856.779297
1194,2011-10-03,10912.099609,10979.190430,10653.339844,10655.299805,242870000,10655.299805
754,2013-07-05,14995.459961,15137.509766,14971.200195,15135.839844,94560000,15135.839844
1371,2011-01-20,11823.700195,11845.160156,11744.769531,11822.799805,180800000,11822.799805


## 6. Conditional selection of columns<a id='19'></a>
Syntax 1 : `df[df.column_name < value]`

Syntax 2 : `df[df.column_name > value]`

Syntax 3 : `df[df.column_name == value]`

Syntax 4 : `df[df.column_name <= value]`

Syntax 5 : `df[df.column_name >= value]`

In [79]:
df[df.Open >= 18281.949219]

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
280,2015-05-22,18286.869141,18286.869141,18217.140625,18232.019531,78890000,18232.019531
281,2015-05-21,18285.869141,18314.890625,18249.900391,18285.740234,84270000,18285.740234
282,2015-05-20,18315.060547,18350.130859,18272.560547,18285.400391,80190000,18285.400391
283,2015-05-19,18300.480469,18351.359375,18261.349609,18312.390625,87200000,18312.390625
337,2015-03-03,18281.949219,18281.949219,18136.880859,18203.369141,83830000,18203.369141


# **Slicing**<a id='20'></a>
Slicing in Python is a feature that enables accessing parts of sequences like strings, tuples, and lists. You can also use them to modify or delete the items of mutable sequences such as lists. Slices can also be applied on third-party objects like NumPy arrays, as well as Pandas series and data frames.

Slicing enables writing clean, concise, and readable code.

## 1. Slicing rows and columns using labels<a id='21'></a>
You can select a range of rows or columns using labels or by position. To slice by labels you use **loc** attribute of the DataFrame.

### 1.1 Slice row by label<a id='22'></a>
Syntax : `df.loc[starting_row_label : ending_row_label, :]`

In [22]:
df.loc[1:5, :]

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
1,2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234
2,2016-06-29,17456.019531,17704.509766,17456.019531,17694.679688,106380000,17694.679688
3,2016-06-28,17190.509766,17409.720703,17190.509766,17409.720703,112190000,17409.720703
4,2016-06-27,17355.210938,17355.210938,17063.080078,17140.240234,138740000,17140.240234
5,2016-06-24,17946.630859,17946.630859,17356.339844,17400.75,239000000,17400.75


### 1.2 Slice columns by label<a id='23'></a>
Syntax : `df.loc[:, 'starting_column_name' : 'ending_column_name']`

In [25]:
df.loc[:, 'Open' : 'Close']

Unnamed: 0,Open,High,Low,Close
0,17924.240234,18002.380859,17916.910156,17949.369141
1,17712.759766,17930.609375,17711.800781,17929.990234
2,17456.019531,17704.509766,17456.019531,17694.679688
3,17190.509766,17409.720703,17190.509766,17409.720703
4,17355.210938,17355.210938,17063.080078,17140.240234
...,...,...,...,...
1984,11532.070312,11718.280273,11450.889648,11615.929688
1985,11632.809570,11633.780273,11453.339844,11532.959961
1986,11781.700195,11782.349609,11601.519531,11642.469727
1987,11729.669922,11867.110352,11675.530273,11782.349609


### 1.3 Slice row and columns by label<a id='24'></a>
Syntax : `df.loc[starting_row_label : ending_row_label, 'starting_column_name' : 'ending_column_name']`

In [26]:
df.loc[1:3, 'Open' : 'Close']

Unnamed: 0,Open,High,Low,Close
1,17712.759766,17930.609375,17711.800781,17929.990234
2,17456.019531,17704.509766,17456.019531,17694.679688
3,17190.509766,17409.720703,17190.509766,17409.720703


## 2. Slicing rows and columns by position<a id='25'></a>
To slice a Pandas dataframe by position use the iloc attribute. Remember index starts from 0 to (number of rows/columns - 1).

### 2.1 To slice rows by index position<a id='26'></a>
Syntax : `df.iloc[starting_row_index : ending_row_index, :]`

In [32]:
df.iloc[0:2, :]

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141
1,2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234


### 2.2 To slice columns by index position<a id='27'></a>
Syntax : `df.iloc[:, starting_column_index : ending_column_index]`

In [34]:
df.iloc[:, 1:4]

Unnamed: 0,Open,High,Low
0,17924.240234,18002.380859,17916.910156
1,17712.759766,17930.609375,17711.800781
2,17456.019531,17704.509766,17456.019531
3,17190.509766,17409.720703,17190.509766
4,17355.210938,17355.210938,17063.080078
...,...,...,...
1984,11532.070312,11718.280273,11450.889648
1985,11632.809570,11633.780273,11453.339844
1986,11781.700195,11782.349609,11601.519531
1987,11729.669922,11867.110352,11675.530273


### 2.3 To slice row and columns by index position<a id='28'></a>
Syntax 1 : `df.iloc[starting_row_index : ending_row_index, starting_column_index : ending_column_index]`

Syntax 2 : `df.iloc[:starting_row_index, :ending_column_index]`

In [45]:
df.iloc[0:2, 0:2]

Unnamed: 0,Date,Open
0,2016-07-01,17924.240234
1,2016-06-30,17712.759766


In [37]:
df.iloc[:2, :2]

Unnamed: 0,Date,Open
0,2016-07-01,17924.240234
1,2016-06-30,17712.759766


## Subsetting by boolean conditions<a id='29'></a>
You can use boolean conditions to obtain a subset of the data from dataframe.

## 1. Select rows based on column value<a id='30'></a>

### 1.1 To select all rows whose column contain the specified value(s)<a id='31'></a>
Syntax 1 : `df.column_name == value`

Syntax 2 : `df.loc[df.column_name == value]`

Syntax 3 : `df[df.Open == value]`

Syntax 4 : `df[df.column_name.isin([value1, value2, ...])]`

In [47]:
df.Open == 17355.210938

0       False
1       False
2       False
3       False
4        True
        ...  
1984    False
1985    False
1986    False
1987    False
1988    False
Name: Open, Length: 1989, dtype: bool

In [80]:
df[df.Open == 17355.210938]

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
4,2016-06-27,17355.210938,17355.210938,17063.080078,17140.240234,138740000,17140.240234


In [48]:
df.loc[df.Open == 17355.210938]

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
4,2016-06-27,17355.210938,17355.210938,17063.080078,17140.240234,138740000,17140.240234


In [49]:
df[df.Open.isin([17355.210938])]

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
4,2016-06-27,17355.210938,17355.210938,17063.080078,17140.240234,138740000,17140.240234


### 1.2 Rows that match multiple column conditions<a id='32'></a>
Syntax : `df[(df.column_name == value) | (df.column_name == value)]`

In [51]:
df[(df.Open == 17355.210938) | (df.Close == 17949.369141)]

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2016-07-01,17924.240234,18002.380859,17916.910156,17949.369141,82160000,17949.369141
4,2016-06-27,17355.210938,17355.210938,17063.080078,17140.240234,138740000,17140.240234


### 1.3 Select rows whose column DOES NOT contain specified values<a id='33'></a>
Syntax : `df[~df.column_name.isin([value])]`

In [54]:
# row 0 and 2 contains these values so it is not in output
df[~df.Open.isin([17924.240234, 17456.019531])]

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
1,2016-06-30,17712.759766,17930.609375,17711.800781,17929.990234,133030000,17929.990234
3,2016-06-28,17190.509766,17409.720703,17190.509766,17409.720703,112190000,17409.720703
4,2016-06-27,17355.210938,17355.210938,17063.080078,17140.240234,138740000,17140.240234
5,2016-06-24,17946.630859,17946.630859,17356.339844,17400.750000,239000000,17400.750000
6,2016-06-23,17844.109375,18011.070312,17844.109375,18011.070312,98070000,18011.070312
...,...,...,...,...,...,...,...
1984,2008-08-14,11532.070312,11718.280273,11450.889648,11615.929688,159790000,11615.929688
1985,2008-08-13,11632.809570,11633.780273,11453.339844,11532.959961,182550000,11532.959961
1986,2008-08-12,11781.700195,11782.349609,11601.519531,11642.469727,173590000,11642.469727
1987,2008-08-11,11729.669922,11867.110352,11675.530273,11782.349609,183190000,11782.349609


## 2. Select columns based on row value<a id='34'></a>
To select columns where rows contain the specified value.

Syntax : `df.loc[:, df.isin([value]).any()]`

In [57]:
df.loc[:, df.isin([17456.019531]).any()]

Unnamed: 0,Open,Low
0,17924.240234,17916.910156
1,17712.759766,17711.800781
2,17456.019531,17456.019531
3,17190.509766,17190.509766
4,17355.210938,17063.080078
...,...,...
1984,11532.070312,11450.889648
1985,11632.809570,11453.339844
1986,11781.700195,11601.519531
1987,11729.669922,11675.530273


## 3. Subsetting using filter method<a id='35'></a>
Subsets can be created using the filter method like below.

Method 1 : `df.filter(items=['column_name1', 'column_name2'])`

Method 2 : `df.filter(like='row_index/label', axis=0)`

Method 3 : `df.filter(regex='[^column_letter]')`

Method 4 : `df[(df['column_name'] > value) & (df['column_name'] > value)]`

In [58]:
df.filter(items=['Open', 'Close'])

Unnamed: 0,Open,Close
0,17924.240234,17949.369141
1,17712.759766,17929.990234
2,17456.019531,17694.679688
3,17190.509766,17409.720703
4,17355.210938,17140.240234
...,...,...
1984,11532.070312,11615.929688
1985,11632.809570,11532.959961
1986,11781.700195,11642.469727
1987,11729.669922,11782.349609


In [59]:
df.filter(like="2", axis=0)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
2,2016-06-29,17456.019531,17704.509766,17456.019531,17694.679688,106380000,17694.679688
12,2016-06-15,17703.650391,17762.960938,17629.009766,17640.169922,94130000,17640.169922
20,2016-06-03,17799.800781,17833.169922,17689.679688,17807.060547,82270000,17807.060547
21,2016-06-02,17789.050781,17838.560547,17703.550781,17838.560547,75560000,17838.560547
22,2016-06-01,17754.550781,17809.179688,17664.789062,17789.669922,78530000,17789.669922
...,...,...,...,...,...,...,...
1942,2008-10-14,9388.969727,9794.370117,9085.429688,9310.990234,412740000,9310.990234
1952,2008-09-30,10371.580078,10868.900391,10371.419922,10850.660156,319770000,10850.660156
1962,2008-09-16,10905.620117,11093.219727,10742.700195,11059.019531,494760000,11059.019531
1972,2008-09-02,11545.629883,11790.169922,11471.900391,11516.919922,177090000,11516.919922


In [64]:
df.filter(regex="[^OpenCloseHigh]")

Unnamed: 0,Date,Low,Volume,Adj Close
0,2016-07-01,17916.910156,82160000,17949.369141
1,2016-06-30,17711.800781,133030000,17929.990234
2,2016-06-29,17456.019531,106380000,17694.679688
3,2016-06-28,17190.509766,112190000,17409.720703
4,2016-06-27,17063.080078,138740000,17140.240234
...,...,...,...,...
1984,2008-08-14,11450.889648,159790000,11615.929688
1985,2008-08-13,11453.339844,182550000,11532.959961
1986,2008-08-12,11601.519531,173590000,11642.469727
1987,2008-08-11,11675.530273,183190000,11782.349609


In [65]:
df[(df['Open'] > 18281.949219) & (df['Date'] > '2015-05-20')]

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
280,2015-05-22,18286.869141,18286.869141,18217.140625,18232.019531,78890000,18232.019531
281,2015-05-21,18285.869141,18314.890625,18249.900391,18285.740234,84270000,18285.740234
