## Selection
- Provides details about selecting information from a data frame

### Difference between interactive and production work

Note:while standard Python/Numpy expressions for selecting and setting  
    are intuitive and come in handy for interactive work, for  
    production code, we recommend the optimized pandas data access  
    methods.

In [1]:
import pandas as pd
import numpy as np

In [2]:
#create a sample numpy data
sample_numpy_data = np.array(np.arange(24)).reshape((6,4))
sample_numpy_data

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

In [3]:
#create a date_range using pandas
dates_index=pd.date_range('20160101',periods=6)
dates_index

DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06'],
              dtype='datetime64[ns]', freq='D')

#### DataFrame method - Creation
- pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

In [4]:
#creating a data frame
sample_df=pd.DataFrame(sample_numpy_data,dates_index,columns=list('ABCD'))
sample_df

Unnamed: 0,A,B,C,D
2016-01-01,0,1,2,3
2016-01-02,4,5,6,7
2016-01-03,8,9,10,11
2016-01-04,12,13,14,15
2016-01-05,16,17,18,19
2016-01-06,20,21,22,23


### DataFrame method - selection with Column name

In [5]:
sample_df['C'] #It gives all the information of C column 
#for all rows indexed with date index

2016-01-01     2
2016-01-02     6
2016-01-03    10
2016-01-04    14
2016-01-05    18
2016-01-06    22
Freq: D, Name: C, dtype: int64

#### selection with slice
 - Remember..slice is <b>upto but not including</b>
 - here index is dates index ie row level index..u get data on row level

In [6]:
sample_df[1:4] #from 1 upto 4 ie 1 to 3

Unnamed: 0,A,B,C,D
2016-01-02,4,5,6,7
2016-01-03,8,9,10,11
2016-01-04,12,13,14,15


#### selection with date time index
 - <b>Note: </b>last index is included

In [7]:
sample_df['2016-01-01':'2016-01-04']

Unnamed: 0,A,B,C,D
2016-01-01,0,1,2,3
2016-01-02,4,5,6,7
2016-01-03,8,9,10,11
2016-01-04,12,13,14,15


## Selection by label
- documentation - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html
- <b>Dataframes.loc</b> function is a label based indexer used for selection by label.

In [8]:
#ABOVE WE GOT BY FIVING THE ACTUAL VALUES
#but what if you want to pick without knowing the values
#we can do that by loc method
sample_df.loc[dates_index[1:3]]

Unnamed: 0,A,B,C,D
2016-01-02,4,5,6,7
2016-01-03,8,9,10,11


##### selection using multi-axis by label

In [9]:
#like all rows should be returned for columns a & b?
sample_df.loc[:,['A','B']]

Unnamed: 0,A,B
2016-01-01,0,1
2016-01-02,4,5
2016-01-03,8,9
2016-01-04,12,13
2016-01-05,16,17
2016-01-06,20,21


#### Label slicing;index slicing..but endpoints are included

In [10]:
#ABOVE LISTED ALL ROWS..now lets try with slicing
sample_df.loc['2016-01-01':'2016-01-03',['A','D']]

Unnamed: 0,A,D
2016-01-01,0,3
2016-01-02,4,7
2016-01-03,8,11


##### Reduce number of dimensions for returned object
-<b>notice</b> order of 'D' and 'B'

In [11]:
sample_df.loc['2016-01-03',['D','B']] # ONE ROW TWO COLUMNS

D    11
B     9
Name: 2016-01-03 00:00:00, dtype: int64

##### using result
-  However, if we want to use the results in an arithmetical operation, we don't want to include the column names

In [12]:
sample_df.loc['2016-01-03',['D','B']][0]*4

44

##### Select a scalar
- like 3 row..select 'C' value

In [13]:
sample_df.loc[dates_index[2],'C']

10

## Selection by Position
- documentation - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html
- integer-location based indexing for selection by position

In [14]:
#normal data selection
sample_numpy_data

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

In [15]:
sample_numpy_data[3]

array([12, 13, 14, 15])

#### iloc
- Purely integer-location based indexing for selection by position.

In [16]:
sample_df.iloc[3]

A    12
B    13
C    14
D    15
Name: 2016-01-04 00:00:00, dtype: int64

##### integer slices

In [17]:
sample_df.iloc[1:3,2:4]

Unnamed: 0,C,D
2016-01-02,6,7
2016-01-03,10,11


##### list of integers
    - contains all the rows and columns

In [18]:
sample_df.iloc[[0,1,3],[0,2]] #rows and columns

Unnamed: 0,A,C
2016-01-01,0,2
2016-01-02,4,6
2016-01-04,12,14


##### slicing rows explicitly
##### selecting all clumns implicitly

In [19]:
sample_df.iloc[1:3,:]

Unnamed: 0,A,B,C,D
2016-01-02,4,5,6,7
2016-01-03,8,9,10,11


##### slicing columns explicitly
##### selecting all rows implicitly

In [20]:
sample_df.iloc[:,1:3]

Unnamed: 0,B,C
2016-01-01,1,2
2016-01-02,5,6
2016-01-03,9,10
2016-01-04,13,14
2016-01-05,17,18
2016-01-06,21,22



### Boolean indexing
##### test based on one columns name

In [21]:
sample_df.C>=14

2016-01-01    False
2016-01-02    False
2016-01-03    False
2016-01-04     True
2016-01-05     True
2016-01-06     True
Freq: D, Name: C, dtype: bool

##### test based on entire data set

In [22]:
#INSTEAD of doing boolean on one column
# we can do it on entire server
sample_df

Unnamed: 0,A,B,C,D
2016-01-01,0,1,2,3
2016-01-02,4,5,6,7
2016-01-03,8,9,10,11
2016-01-04,12,13,14,15
2016-01-05,16,17,18,19
2016-01-06,20,21,22,23


In [23]:
sample_df[sample_df >=11]
#below NaN indicates - For any value in the data frame, 
#that was less than or equal to 11.

Unnamed: 0,A,B,C,D
2016-01-01,,,,
2016-01-02,,,,
2016-01-03,,,,11.0
2016-01-04,12.0,13.0,14.0,15.0
2016-01-05,16.0,17.0,18.0,19.0
2016-01-06,20.0,21.0,22.0,23.0


### isin() method
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html
- Returns boolean DataFrame showing whether each element in the DataFrame is contained in values.    

In [24]:
sample_df_2=sample_df.copy()
sample_df_2

Unnamed: 0,A,B,C,D
2016-01-01,0,1,2,3
2016-01-02,4,5,6,7
2016-01-03,8,9,10,11
2016-01-04,12,13,14,15
2016-01-05,16,17,18,19
2016-01-06,20,21,22,23


In [25]:
#lets append a new column to the data frame
sample_df_2['Fruits']=['apple','orange','bananas','strawberry','blueberry','pineapple']
sample_df_2

Unnamed: 0,A,B,C,D,Fruits
2016-01-01,0,1,2,3,apple
2016-01-02,4,5,6,7,orange
2016-01-03,8,9,10,11,bananas
2016-01-04,12,13,14,15,strawberry
2016-01-05,16,17,18,19,blueberry
2016-01-06,20,21,22,23,pineapple


##### select rows where 'Fruits' column contains eith 'bananas' or 'pineapple'
- notice - below 'smoothly' not in the df

In [26]:
sample_df_2[sample_df_2['Fruits'].isin(['bananas','pineapple','smoothly'])]

Unnamed: 0,A,B,C,D,Fruits
2016-01-03,8,9,10,11,bananas
2016-01-06,20,21,22,23,pineapple
