![*INTERTECHNICA - SOLON EDUCATIONAL PROGRAMS - TECHNOLOGY LINE*](https://solon.intertechnica.com/assets/IntertechnicaSolonEducationalPrograms-TechnologyLine.png)

# Data Manipulation with Python - The Pandas Library - Advanced Indexing and Slicing

*Basic initialization of the workspace.*

In [1]:
!python -m pip install numpy
import numpy as np
print ("NumPy installed at version: {}".format(np.__version__))

NumPy installed at version: 1.19.5


In [2]:
!python -m pip install pandas
import pandas as pd
print ("Pandas installed at version: {}".format(pd.__version__))

#adjust pandas DataFrame display for a wider target 
pd.set_option('display.expand_frame_repr', False)

Pandas installed at version: 1.1.5


In [3]:
# load a sample data series
sample_data_series = pd.read_excel(
   "https://raw.githubusercontent.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/main/Module%203%20-%20The%20Pandas%20Library/Session%202%20-%20Pandas%20basics/data/RO_Urban_Air_Polution.xlsx",
   skiprows = 1,
   index_col = 0  
).squeeze()

# load a sample data frame
sample_data_frame = pd.read_excel(
    "https://github.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/raw/main/Module%203%20-%20The%20Pandas%20Library/Session%202%20-%20Pandas%20basics/data/RO_Science_And_Technology.xlsx",
 )


## 1. Series indexing and slicing

While similar to NumPy, the Pandas index and slicing mechanisms are an improvement: they ensure both index and positional approach for data selection.

### 1.1 Accessing the series index and values

The Pandas series index and values can be accessed by the **index** and **values** properties:

In [4]:
# display the index and values for the data series
print(
    "The data series \n{} n has the index \n{}\n and the values \n{}\n".format(
        sample_data_series,
        sample_data_series.index,
        sample_data_series.values
    )
)

The data series 
Year
1990    20.32
1995    18.57
2000    17.70
2005    17.16
2010    17.06
2011    17.64
2012    16.32
2013    15.37
2014    15.01
2015    15.43
2016    14.54
2017    14.61
Name: Air Pollution Mean Exposure, dtype: float64 n has the index 
Int64Index([1990, 1995, 2000, 2005, 2010, 2011, 2012, 2013, 2014, 2015, 2016,
            2017],
           dtype='int64', name='Year')
 and the values 
[20.32 18.57 17.7  17.16 17.06 17.64 16.32 15.37 15.01 15.43 14.54 14.61]



The **values** attributes of the index allows the access of the index values as an array: 

In [5]:
print(
    "The values of the index \n{}\n are \n{}\n".format(
        sample_data_series.index,
        sample_data_series.index.values        
    )
)

The values of the index 
Int64Index([1990, 1995, 2000, 2005, 2010, 2011, 2012, 2013, 2014, 2015, 2016,
            2017],
           dtype='int64', name='Year')
 are 
[1990 1995 2000 2005 2010 2011 2012 2013 2014 2015 2016 2017]



It is important to note that in case the index of the series is a datetime type, the series is usually known as **timeseries**. They are frequently used in industry, being very important for temporal data analysis. 

In [6]:
# create a timeseries based on existing data
time_series_index = [pd.Timestamp(str(value) + "-01-01") for value in sample_data_series.index]
time_series_values = sample_data_series.values;

time_series = pd.Series(
    data = time_series_values,
    index = time_series_index
)

print(
    "The generated timeseries is: \n{}".format(
        time_series
    )
)

The generated timeseries is: 
1990-01-01    20.32
1995-01-01    18.57
2000-01-01    17.70
2005-01-01    17.16
2010-01-01    17.06
2011-01-01    17.64
2012-01-01    16.32
2013-01-01    15.37
2014-01-01    15.01
2015-01-01    15.43
2016-01-01    14.54
2017-01-01    14.61
dtype: float64


In case of timeseries, the elements can be accessed by timestamp data:

In [7]:
# select data based on timestamp indexes
timestamps = [pd.Timestamp("1990-01-01"), pd.Timestamp("2017-01-01")]

print(
    "The elements at index {} are \n{}".format(
      timestamps,
      time_series.loc[timestamps]        
    )
)



The elements at index [Timestamp('1990-01-01 00:00:00'), Timestamp('2017-01-01 00:00:00')] are 
1990-01-01    20.32
2017-01-01    14.61
dtype: float64


Indexed access is used also in order to remove elements by index via the [**drop**](https://pandas.pydata.org/docs/reference/api/pandas.Series.drop.html) method; it returns by default a new series with the elements associated to a specified index removed: 

In [8]:
# select series data by positional value
drop_index_values = [1990, 1995]
print(
    "After dropping the elements with the index {}, the new series is \n{}".format(
      drop_index_values,
      sample_data_series.drop(drop_index_values)        
    )
)

After dropping the elements with the index [1990, 1995], the new series is 
Year
2000    17.70
2005    17.16
2010    17.06
2011    17.64
2012    16.32
2013    15.37
2014    15.01
2015    15.43
2016    14.54
2017    14.61
Name: Air Pollution Mean Exposure, dtype: float64


### 1.2 Index and positional access

The access of series data by using index values is enabled by the [**loc**](https://pandas.pydata.org/docs/reference/api/pandas.Series.loc.html) property:   

In [9]:
# select series data by index value
index_values = [2000, 2005 , 2010]
print(
    "The elements in the data series with index {} are \n{}".format(
      index_values,
      sample_data_series.loc[index_values]        
    )
)

The elements in the data series with index [2000, 2005, 2010] are 
Year
2000    17.70
2005    17.16
2010    17.06
Name: Air Pollution Mean Exposure, dtype: float64


In the same manner, to access data series by position, the [**iloc**](https://pandas.pydata.org/docs/reference/api/pandas.Series.iloc.html) property is used:

In [10]:
# select series data by positional value
positional_values = [2, 3, 4]
print(
    "The elements in the data series with positional values {} are \n{}".format(
      positional_values,
      sample_data_series.iloc[positional_values]        
    )
)

The elements in the data series with positional values [2, 3, 4] are 
Year
2000    17.70
2005    17.16
2010    17.06
Name: Air Pollution Mean Exposure, dtype: float64


It must be noted that using the access based on the **loc** mechanism, the **last element of an interval is returned as well** (unlike the case of NumPy index interval access):

In [11]:
interval_low_limit = 2000
interval_high_limit = 2010

print(
    "The elements in the index interval [{} - {}] are \n{}".format(
      interval_low_limit,
      interval_high_limit,
      sample_data_series.loc[interval_low_limit: interval_high_limit]        
    ))

The elements in the index interval [2000 - 2010] are 
Year
2000    17.70
2005    17.16
2010    17.06
Name: Air Pollution Mean Exposure, dtype: float64


Furthermore, it must be noted that both **loc** and **iloc** support boolean indexes (logical masking).

### 1.3 NumPy-like indexing and slicing

The Pandas series support standard slicing operations that are presents in NumPy, in these cases the numeric slicing values being considered as applied to positional access:

In [12]:
# getting the shape of the data series
print(
    "The data series shape is {}".format(
      sample_data_series.shape        
    )
)

The data series shape is (12,)


In [13]:
# selecting the first elements
num_elements = 3
print(
    "The first {} elements in the data series are {}".format(
      num_elements,
      sample_data_series[0:num_elements]        
    )
)

The first 3 elements in the data series are Year
1990    20.32
1995    18.57
2000    17.70
Name: Air Pollution Mean Exposure, dtype: float64


In [14]:
# selecting the last elements
print(
    "The last {} elements in the data series are {}".format(
      num_elements,
      sample_data_series[-num_elements:]        
    )
)

The last 3 elements in the data series are Year
2015    15.43
2016    14.54
2017    14.61
Name: Air Pollution Mean Exposure, dtype: float64


In [15]:
# selecting elements where value is higher than a limit
value_lower_limit = 18
print(
    "The elements where value is higher than {} are \n{}".format(
      value_lower_limit,
      sample_data_series[sample_data_series > value_lower_limit]        
    )
)

The elements where value is higher than 18 are 
Year
1990    20.32
1995    18.57
Name: Air Pollution Mean Exposure, dtype: float64


In [16]:
# selecting elements where index is higher than a value
index_lower_limit = 2000
print(
    "The elements where index is higher than {} are \n{}".format(
      index_lower_limit,
      sample_data_series[sample_data_series.index > index_lower_limit]        
    )
)

The elements where index is higher than 2000 are 
Year
2005    17.16
2010    17.06
2011    17.64
2012    16.32
2013    15.37
2014    15.01
2015    15.43
2016    14.54
2017    14.61
Name: Air Pollution Mean Exposure, dtype: float64


## 2. DataFrames indexing and slicing

Unlike Pandas series, the Pandas DataFrame allows the access by considering two dimensions: the rows and columns. We need to remember that any DataFrame column is the equivalent of a data series.

### 2.1 Accessing the index and values of a DataFrame

The index and values of a data frame is accessed in the same manner as for series, via the [**index**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.index.html) and values [**values**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.values.html) properties:

In [17]:
# access the dataframe's index and the index values
print(
    "The index of the data series is {} with the index values \n{}".format(
        sample_data_frame.index,
        sample_data_frame.index.values
    )
)

The index of the data series is RangeIndex(start=0, stop=415, step=1) with the index values 
[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
 216 217 218 219 220 221 222 22

In [18]:
# access the dataframe's values and its shape
print(
    "The values of the data series are \n{}\n with the shape \n{}".format(
        sample_data_frame.values,
        sample_data_frame.values.shape
    )
)

The values of the data series are 
[['Romania' 'ROU' 2020
  'Charges for the use of intellectual property, payments (BoP, current US$)'
  'BM.GSR.ROYL.CD' 886842442.49073]
 ['Romania' 'ROU' 2019
  'Charges for the use of intellectual property, payments (BoP, current US$)'
  'BM.GSR.ROYL.CD' 936735170.387543]
 ['Romania' 'ROU' 2018
  'Charges for the use of intellectual property, payments (BoP, current US$)'
  'BM.GSR.ROYL.CD' 962384646.507165]
 ...
 ['Romania' 'ROU' 2009
  'High-technology exports (% of manufactured exports)'
  'TX.VAL.TECH.MF.ZS' 10.367542909696]
 ['Romania' 'ROU' 2008
  'High-technology exports (% of manufactured exports)'
  'TX.VAL.TECH.MF.ZS' 6.92214905365052]
 ['Romania' 'ROU' 2007
  'High-technology exports (% of manufactured exports)'
  'TX.VAL.TECH.MF.ZS' 4.41427750893872]]
 with the shape 
(415, 6)


### 2.2 Accessing DataFrame's data by rows and columns 

Tha data frame's rows can be accessed by index, as in case of data series, by using the [**loc**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) and [**iloc**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) methods.

In [19]:
# select a sequence of elements by index value
num_elements = 3
start_index_values = sample_data_frame.index[0: num_elements]
print(
    "The first {} elements in the data frame by index value are \n{}".format(
        num_elements,
        sample_data_frame.loc[start_index_values]
    )
)

The first 3 elements in the data frame by index value are 
  Country Name Country ISO3  Year                                     Indicator Name  Indicator Code         Value
0      Romania          ROU  2020  Charges for the use of intellectual property, ...  BM.GSR.ROYL.CD  8.868424e+08
1      Romania          ROU  2019  Charges for the use of intellectual property, ...  BM.GSR.ROYL.CD  9.367352e+08
2      Romania          ROU  2018  Charges for the use of intellectual property, ...  BM.GSR.ROYL.CD  9.623846e+08


In [20]:
# select a sequence of elements by positional value
print(
    "The last {} elements in the data frame by positional value are \n{}".format(
        num_elements,
        sample_data_frame.iloc[-num_elements:]
    )
)

The last 3 elements in the data frame by positional value are 
    Country Name Country ISO3  Year                                     Indicator Name     Indicator Code      Value
412      Romania          ROU  2009  High-technology exports (% of manufactured exp...  TX.VAL.TECH.MF.ZS  10.367543
413      Romania          ROU  2008  High-technology exports (% of manufactured exp...  TX.VAL.TECH.MF.ZS   6.922149
414      Romania          ROU  2007  High-technology exports (% of manufactured exp...  TX.VAL.TECH.MF.ZS   4.414278


The rows from data frames can be removed via the [**drop**](https://) method (the same mechanism as in the case of data series). The index values are used to identify the rows to be removed and the parameter axis should be set to 0:

In [21]:
# removing rows from a data frame
rows_to_keep = 10
print(
    "The records from the data frame with all records removed, except the last {} rows are \n{}".format(
        rows_to_keep,
        sample_data_frame.drop(
            sample_data_frame.index.values[0:-rows_to_keep],
            axis = 0,
        )
      )
    )

The records from the data frame with all records removed, except the last 10 rows are 
    Country Name Country ISO3  Year                                     Indicator Name     Indicator Code      Value
405      Romania          ROU  2016  High-technology exports (% of manufactured exp...  TX.VAL.TECH.MF.ZS  10.392441
406      Romania          ROU  2015  High-technology exports (% of manufactured exp...  TX.VAL.TECH.MF.ZS   9.414148
407      Romania          ROU  2014  High-technology exports (% of manufactured exp...  TX.VAL.TECH.MF.ZS   8.409232
408      Romania          ROU  2013  High-technology exports (% of manufactured exp...  TX.VAL.TECH.MF.ZS   7.398176
409      Romania          ROU  2012  High-technology exports (% of manufactured exp...  TX.VAL.TECH.MF.ZS   8.179880
410      Romania          ROU  2011  High-technology exports (% of manufactured exp...  TX.VAL.TECH.MF.ZS  11.635839
411      Romania          ROU  2010  High-technology exports (% of manufactured exp...  TX.VAL

The columns of a data frame can be accessed via the [**columns**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html) property which returns a collection of values containing the column names:

In [22]:
# access the columns of the data frame
print(
    "The columns of the data base are \n{}:".format(
      sample_data_frame.columns.values       
    )
)

The columns of the data base are 
['Country Name' 'Country ISO3' 'Year' 'Indicator Name' 'Indicator Code'
 'Value']:


The values of the columns in a data frame are accessed by passing the column name as an index:

In [23]:
# accessing the column data by column name
column_name = 'Indicator Name'

print(
    "The values of the '{}' column are \n{}".format(
        column_name,
        sample_data_frame[column_name]
    )
)

# we can observe in this case that the returned value
# is a data series 

The values of the 'Indicator Name' column are 
0      Charges for the use of intellectual property, ...
1      Charges for the use of intellectual property, ...
2      Charges for the use of intellectual property, ...
3      Charges for the use of intellectual property, ...
4      Charges for the use of intellectual property, ...
                             ...                        
410    High-technology exports (% of manufactured exp...
411    High-technology exports (% of manufactured exp...
412    High-technology exports (% of manufactured exp...
413    High-technology exports (% of manufactured exp...
414    High-technology exports (% of manufactured exp...
Name: Indicator Name, Length: 415, dtype: object


Multiple columns may be specified in order to select a subset of a data frame:

In [24]:
# accessing the column data by multiple column names
column_names = ['Indicator Name', 'Value']

print(
    "The values of the '{}' column are \n{}".format(
        column_names,
        sample_data_frame[column_names]
    )
)

The values of the '['Indicator Name', 'Value']' column are 
                                        Indicator Name         Value
0    Charges for the use of intellectual property, ...  8.868424e+08
1    Charges for the use of intellectual property, ...  9.367352e+08
2    Charges for the use of intellectual property, ...  9.623846e+08
3    Charges for the use of intellectual property, ...  9.110375e+08
4    Charges for the use of intellectual property, ...  8.320201e+08
..                                                 ...           ...
410  High-technology exports (% of manufactured exp...  1.163584e+01
411  High-technology exports (% of manufactured exp...  1.252824e+01
412  High-technology exports (% of manufactured exp...  1.036754e+01
413  High-technology exports (% of manufactured exp...  6.922149e+00
414  High-technology exports (% of manufactured exp...  4.414278e+00

[415 rows x 2 columns]


Data from a data frame can be removed by using the same drop method, specifying either the axis = 1 parameter or using the columns parameter with the column names: 

In [25]:
# drop the data by columns by name
# using the axis parameter
column_names_for_removal = ["Country Name", "Country ISO3", "Indicator Code"]

print(
    "The data in the data frame after removing the columns {} is \n{}".format(
        column_names_for_removal,
        sample_data_frame.drop(
            column_names_for_removal,
            axis = 1
        )        
    )
  )

The data in the data frame after removing the columns ['Country Name', 'Country ISO3', 'Indicator Code'] is 
     Year                                     Indicator Name         Value
0    2020  Charges for the use of intellectual property, ...  8.868424e+08
1    2019  Charges for the use of intellectual property, ...  9.367352e+08
2    2018  Charges for the use of intellectual property, ...  9.623846e+08
3    2017  Charges for the use of intellectual property, ...  9.110375e+08
4    2016  Charges for the use of intellectual property, ...  8.320201e+08
..    ...                                                ...           ...
410  2011  High-technology exports (% of manufactured exp...  1.163584e+01
411  2010  High-technology exports (% of manufactured exp...  1.252824e+01
412  2009  High-technology exports (% of manufactured exp...  1.036754e+01
413  2008  High-technology exports (% of manufactured exp...  6.922149e+00
414  2007  High-technology exports (% of manufactured exp...  4.41

In [26]:
# dopping the data using directly the columns attributes
print(
    "The data in the data frame after removing the columns {} is \n{}".format(
        column_names_for_removal,
        sample_data_frame.drop(
            columns = column_names_for_removal
        )        
    )
  )

The data in the data frame after removing the columns ['Country Name', 'Country ISO3', 'Indicator Code'] is 
     Year                                     Indicator Name         Value
0    2020  Charges for the use of intellectual property, ...  8.868424e+08
1    2019  Charges for the use of intellectual property, ...  9.367352e+08
2    2018  Charges for the use of intellectual property, ...  9.623846e+08
3    2017  Charges for the use of intellectual property, ...  9.110375e+08
4    2016  Charges for the use of intellectual property, ...  8.320201e+08
..    ...                                                ...           ...
410  2011  High-technology exports (% of manufactured exp...  1.163584e+01
411  2010  High-technology exports (% of manufactured exp...  1.252824e+01
412  2009  High-technology exports (% of manufactured exp...  1.036754e+01
413  2008  High-technology exports (% of manufactured exp...  6.922149e+00
414  2007  High-technology exports (% of manufactured exp...  4.41

The data from a data frame can be acccessed by both the index and column values in order to obtain full data subsets: 

In [27]:
# selecting data by column names and index ranges
column_names = ["Indicator Name", "Value"]
count_values = 10

print(
    "The data the data frame with columns {} and the first {} index values are: \n{}".format(
        column_names,
        count_values,
        sample_data_frame[column_names][0:count_values]
    )
)

The data the data frame with columns ['Indicator Name', 'Value'] and the first 10 index values are: 
                                      Indicator Name         Value
0  Charges for the use of intellectual property, ...  8.868424e+08
1  Charges for the use of intellectual property, ...  9.367352e+08
2  Charges for the use of intellectual property, ...  9.623846e+08
3  Charges for the use of intellectual property, ...  9.110375e+08
4  Charges for the use of intellectual property, ...  8.320201e+08
5  Charges for the use of intellectual property, ...  8.344480e+08
6  Charges for the use of intellectual property, ...  9.012962e+08
7  Charges for the use of intellectual property, ...  8.798034e+08
8  Charges for the use of intellectual property, ...  4.525252e+08
9  Charges for the use of intellectual property, ...  4.824162e+08


The data frame values can be accessed as well by using logical masking (boolean indexes):

In [28]:
# using logical masking for selecting data frame records
print(
        "The records related to Agricultural Land, from 1990 onwards, are: \n{}".format(
            sample_data_frame[
              (sample_data_frame["Year"] > 1990) &
              (sample_data_frame["Indicator Name"] == "High-technology exports (current US$)") 	
            ]
        )
  )

The records related to Agricultural Land, from 1990 onwards, are: 
    Country Name Country ISO3  Year                         Indicator Name  Indicator Code         Value
387      Romania          ROU  2020  High-technology exports (current US$)  TX.VAL.TECH.CD  6.984613e+09
388      Romania          ROU  2019  High-technology exports (current US$)  TX.VAL.TECH.CD  6.994469e+09
389      Romania          ROU  2018  High-technology exports (current US$)  TX.VAL.TECH.CD  6.636988e+09
390      Romania          ROU  2017  High-technology exports (current US$)  TX.VAL.TECH.CD  5.558595e+09
391      Romania          ROU  2016  High-technology exports (current US$)  TX.VAL.TECH.CD  5.254484e+09
392      Romania          ROU  2015  High-technology exports (current US$)  TX.VAL.TECH.CD  4.436063e+09
393      Romania          ROU  2014  High-technology exports (current US$)  TX.VAL.TECH.CD  4.471137e+09
394      Romania          ROU  2013  High-technology exports (current US$)  TX.VAL.TECH.CD  3