## Pandas Indexing and Subsetting

This notebooks focuses on understanding indexing and subsetting in Pandas dataframes.

1. How to define a dataframe using a dictionary.
2. How to set index and use access attributes of a dataframe.
3. How to access specific and slices of rows and columns of a dataframe.

In [2]:
import numpy as np
import pandas as pd
import random
import string

In [65]:
##create a dictionary that has stores 
##students' roll number, math scores, and physics scores.

scores_dict = {
    'id': [''.join(random.choices(
        string.ascii_uppercase + string.digits, k=5)
                 ) for _ in range(30)],
    'roll': np.arange(30) + 1,
    'math_scores': np.random.randint(100, size=(30)),
    'physics_scores': np.random.randint(100, size=(30)),
    'chemistry_scores': np.random.randint(100, size=(30))
}

print(scores_dict)

{'id': ['37BGI', 'UX8DX', 'LD14M', 'NGL3G', 'VZ3ZI', '7SC9C', '1IGQM', 'UTUUP', 'VRU88', 'NIOYE', 'XGZGC', 'O51X8', '4Q7UB', 'Z5TVI', 'CKDS0', 'D6OHM', '5PXSQ', '9VOL3', 'ZGJFF', 'WRKQN', 'E3FK5', 'NIKDD', '5RQHX', 'RCU65', '20X3L', 'IPJ66', 'Z3COV', '7O66I', 'LACYN', 'OEOHY'], 'roll': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]), 'math_scores': array([85,  8, 22, 91, 64, 25, 24, 97,  7, 54, 97, 90, 52, 47, 33, 80, 12,
        7, 45,  3, 91, 64, 98, 90, 11, 97, 84, 89, 63, 41]), 'physics_scores': array([75, 56, 15, 99, 40, 84, 91, 67, 44, 13, 74, 98, 19, 33,  1, 81, 94,
       10, 19, 34, 97,  7, 91, 20, 93, 44, 94, 53, 13, 58]), 'chemistry_scores': array([ 6, 49, 43, 92, 56, 58, 26, 67,  8, 55, 62, 13, 68, 63, 42, 30, 40,
        4, 69, 88, 26, 93, 90, 76, 27, 93,  3, 26, 14, 41])}


In [66]:
##convert the scores_dict to a pandas dataframe

df = pd.DataFrame(scores_dict)
df.head(10)

Unnamed: 0,id,roll,math_scores,physics_scores,chemistry_scores
0,37BGI,1,85,75,6
1,UX8DX,2,8,56,49
2,LD14M,3,22,15,43
3,NGL3G,4,91,99,92
4,VZ3ZI,5,64,40,56
5,7SC9C,6,25,84,58
6,1IGQM,7,24,91,26
7,UTUUP,8,97,67,67
8,VRU88,9,7,44,8
9,NIOYE,10,54,13,55


In [67]:
##make id column the index of the dataframe
df = df.set_index('id')
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 30 entries, 37BGI to OEOHY
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   roll              30 non-null     int64
 1   math_scores       30 non-null     int64
 2   physics_scores    30 non-null     int64
 3   chemistry_scores  30 non-null     int64
dtypes: int64(4)
memory usage: 1.2+ KB


In [68]:
##access individual columns
df['math_scores'] 
df.math_scores

##which method should we use?

id
37BGI    85
UX8DX     8
LD14M    22
NGL3G    91
VZ3ZI    64
7SC9C    25
1IGQM    24
UTUUP    97
VRU88     7
NIOYE    54
XGZGC    97
O51X8    90
4Q7UB    52
Z5TVI    47
CKDS0    33
D6OHM    80
5PXSQ    12
9VOL3     7
ZGJFF    45
WRKQN     3
E3FK5    91
NIKDD    64
5RQHX    98
RCU65    90
20X3L    11
IPJ66    97
Z3COV    84
7O66I    89
LACYN    63
OEOHY    41
Name: math_scores, dtype: int64

In [69]:
##dataframe as a 2d array
df.values
df.values[0]

array([ 1, 85, 75,  6])

## Indexers in Pandas - iloc and loc

In [79]:
##access the 10th row
df.iloc[10]

roll                11
math_scores         97
physics_scores      74
chemistry_scores    62
Name: XGZGC, dtype: int64

In [82]:
##access only math scores for the 10th row
df.iloc[10, 1]

97

In [78]:
##access the first 5 rows of the first 2 columns
df.iloc[:5, :2]

Unnamed: 0_level_0,roll,math_scores
id,Unnamed: 1_level_1,Unnamed: 2_level_1
37BGI,1,85
UX8DX,2,8
LD14M,3,22
NGL3G,4,91
VZ3ZI,5,64


In [76]:
##access values using labels
df.loc['XGZGC':, :'physics_scores']

Unnamed: 0_level_0,roll,math_scores,physics_scores
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
XGZGC,11,97,74
O51X8,12,90,98
4Q7UB,13,52,19
Z5TVI,14,47,33
CKDS0,15,33,1
D6OHM,16,80,81
5PXSQ,17,12,94
9VOL3,18,7,10
ZGJFF,19,45,19
WRKQN,20,3,34
