# Indexing and Selecting Data

In this section, you will:

* Select rows from a dataframe
* Select columns from a dataframe
* Select subsets of dataframes

### Selecting Rows

Selecting rows in dataframes is similar to the indexing you have seen in numpy arrays. The syntax ```df[start_index:end_index]``` will subset rows according to the start and end indices.

In [36]:
import numpy as np
import pandas as pd

market_df = pd.read_csv("global_sales_data//market_fact.csv")
market_df.head()

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.6,0.56
1,Ord_5406,Prod_13,SHP_7549,Cust_1818,42.27,0.01,13,4.56,0.93,0.54
2,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.0,26,1148.9,2.5,0.59
3,Ord_5456,Prod_6,SHP_7625,Cust_1818,2337.89,0.09,43,729.34,14.3,0.37
4,Ord_5485,Prod_17,SHP_7664,Cust_1818,4233.15,0.08,35,1219.87,26.3,0.38


Notice that, by default, pandas assigns integer labels to the rows, starting at 0.

In [11]:
# Selecting the rows from indices 2 to 6
market_df[2:7]

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
2,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.0,26,1148.9,2.5,0.59
3,Ord_5456,Prod_6,SHP_7625,Cust_1818,2337.89,0.09,43,729.34,14.3,0.37
4,Ord_5485,Prod_17,SHP_7664,Cust_1818,4233.15,0.08,35,1219.87,26.3,0.38
5,Ord_5446,Prod_6,SHP_7608,Cust_1818,164.02,0.03,23,-47.64,6.15,0.37
6,Ord_31,Prod_12,SHP_41,Cust_26,14.76,0.01,5,1.32,0.5,0.36


In [12]:
# Selecting alternate rows starting from index = 5
market_df[5::2].head()

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
5,Ord_5446,Prod_6,SHP_7608,Cust_1818,164.02,0.03,23,-47.64,6.15,0.37
7,Ord_4725,Prod_4,SHP_6593,Cust_1641,3410.1575,0.1,48,1137.91,0.99,0.55
9,Ord_4725,Prod_6,SHP_6593,Cust_1641,57.22,0.07,8,-27.72,6.6,0.37
11,Ord_1925,Prod_6,SHP_2637,Cust_708,465.9,0.05,38,79.34,4.86,0.38
13,Ord_2207,Prod_11,SHP_3093,Cust_839,3364.248,0.1,15,-693.23,61.76,0.78


### Selecting Columns

There are two simple ways to select a single column from a dataframe - ```df['column_name']``` and ```df.column_name```.

In [13]:
# Using df['column']
sales = market_df['Sales']
sales.head()


0     136.81
1      42.27
2    4701.69
3    2337.89
4    4233.15
Name: Sales, dtype: float64

In [14]:
# Using df.column
sales = market_df.Sales
sales.head()

0     136.81
1      42.27
2    4701.69
3    2337.89
4    4233.15
Name: Sales, dtype: float64

In [15]:
# Notice that in both these cases, the resultant is a Series object
print(type(market_df['Sales']))
print(type(market_df.Sales))


<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


#### Selecting Multiple Columns 

You can select multiple columns by passing the list of column names inside the ```[]```: ```df[['column_1', 'column_2', 'column_n']]```.

For instance, to select only the columns Cust_id, Sales and Profit:

In [16]:
# Select Cust_id, Sales and Profit:
market_df[['Cust_id', 'Sales', 'Profit']].head()

Unnamed: 0,Cust_id,Sales,Profit
0,Cust_1818,136.81,-30.51
1,Cust_1818,42.27,4.56
2,Cust_1818,4701.69,1148.9
3,Cust_1818,2337.89,729.34
4,Cust_1818,4233.15,1219.87


Notice that in this case, the output is itself a dataframe.

In [17]:
type(market_df[['Cust_id', 'Sales', 'Profit']])

pandas.core.frame.DataFrame

In [18]:
# Similarly, if you select one column using double square brackets, 
# you'll get a df, not Series

type(market_df[['Sales']])

pandas.core.frame.DataFrame

### Selecting Subsets of Dataframes

Until now, you have seen selecting rows and columns using the following ways:
* Selecting rows: ```df[start:stop]```
* Selecting columns: ```df['column']``` or ```df.column``` or ```df[['col_x', 'col_y']]```
    * ```df['column']``` or ```df.column``` return a series
    * ```df[['col_x', 'col_y']]``` returns a dataframe

But pandas does not prefer this way of indexing dataframes, since it has some ambiguity. For instance, let's try and select the third row of the dataframe.



In [19]:
# Trying to select the third row: Throws an error
market_df[2]

KeyError: 2

Pandas throws an error because it is confused whether the ```[2]``` is an *index* or a *label*. Recall from the previous section that you can change the row indices. 

In [20]:
# Changing the row indices to Ord_id
market_df.set_index('Ord_id').head()

Unnamed: 0_level_0,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
Ord_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,23,-30.51,3.6,0.56
Ord_5406,Prod_13,SHP_7549,Cust_1818,42.27,0.01,13,4.56,0.93,0.54
Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.0,26,1148.9,2.5,0.59
Ord_5456,Prod_6,SHP_7625,Cust_1818,2337.89,0.09,43,729.34,14.3,0.37
Ord_5485,Prod_17,SHP_7664,Cust_1818,4233.15,0.08,35,1219.87,26.3,0.38


Now imagine you had a column with entries ```[2, 4, 7, 8 ...]```, and you set that as the index. What should ```df[2]``` return?
The second row, or the row with the index value = 2?

Taking an example from this dataset, say you decide to assign the ```Order_Quantity``` column as the index.

In [24]:
market_df.set_index('Order_Quantity').head()

Unnamed: 0_level_0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Profit,Shipping_Cost,Product_Base_Margin
Order_Quantity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
23,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.81,0.01,-30.51,3.6,0.56
13,Ord_5406,Prod_13,SHP_7549,Cust_1818,42.27,0.01,4.56,0.93,0.54
26,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.69,0.0,1148.9,2.5,0.59
43,Ord_5456,Prod_6,SHP_7625,Cust_1818,2337.89,0.09,729.34,14.3,0.37
35,Ord_5485,Prod_17,SHP_7664,Cust_1818,4233.15,0.08,1219.87,26.3,0.38


Now, what should ```df[13]``` return - the 14th row, or the row with index label 13 (i.e. the second row)?

Because of this and similar other ambiguities, pandas provides **explicit ways** to subset dataframes - position based indexing and label based indexing, which we'll study next.

In [26]:
market_df[13:14]

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
13,Ord_2207,Prod_11,SHP_3093,Cust_839,3364.248,0.1,15,-693.23,61.76,0.78


In [29]:
market_df.loc[13:13]

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
13,Ord_2207,Prod_11,SHP_3093,Cust_839,3364.248,0.1,15,-693.23,61.76,0.78


In [31]:
market_df.iloc[13:14]

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
13,Ord_2207,Prod_11,SHP_3093,Cust_839,3364.248,0.1,15,-693.23,61.76,0.78


In [32]:
market_df.set_index('Ord_Quanty',inplace=True)

In [35]:
market_df.iloc[13:14]

Unnamed: 0_level_0,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
Ord_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Ord_2207,Prod_11,SHP_3093,Cust_839,3364.248,0.1,15,-693.23,61.76,0.78
