# Subsetting

It is a powerful indexing feature using which we can **“select and exclude variables / feature columns ”** from the data frame. We can subset / slice a data frame using various means

In [2]:
import numpy as np
import pandas as pd

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(exam_data,index=labels)

In [3]:
df

Unnamed: 0,name,score,attempts,qualify
a,Anastasia,12.5,1,yes
b,Dima,9.0,3,no
c,Katherine,16.5,2,yes
d,James,,3,no
e,Emily,9.0,2,no
f,Michael,20.0,3,yes
g,Matthew,14.5,1,yes
h,Laura,,1,no
i,Kevin,8.0,2,no
j,Jonas,19.0,1,yes


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
name        10 non-null object
score       8 non-null float64
attempts    10 non-null int64
qualify     10 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 400.0+ bytes


**a. Sub-setting by specifying number of rows**

In [5]:
df[:3]                                        # first 3 rows of the dataframe

Unnamed: 0,name,score,attempts,qualify
a,Anastasia,12.5,1,yes
b,Dima,9.0,3,no
c,Katherine,16.5,2,yes


**b. Sub-setting using the column names**

In [6]:
df[['name','score']]

Unnamed: 0,name,score
a,Anastasia,12.5
b,Dima,9.0
c,Katherine,16.5
d,James,
e,Emily,9.0
f,Michael,20.0
g,Matthew,14.5
h,Laura,
i,Kevin,8.0
j,Jonas,19.0


**c. Sub-setting only the rows[1,3,5,6] of the specific columns from the data frame.**

In [8]:
df.ix[[1,3,5,6],['name','score']]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,name,score
b,Dima,9.0
d,James,
f,Michael,20.0
g,Matthew,14.5


**d. Sub-setting based on some Logical Conditions**

In [9]:
df[f['score'].between(15,20)]                     # selecting the rows with 'score' values between 15 and 20(both inclusive)

Unnamed: 0,name,score,attempts,qualify
c,Katherine,16.5,2,yes
f,Michael,20.0,3,yes
j,Jonas,19.0,1,yes


In [10]:
df[(f['score']>15) & (f['attempts']<2)]         # selecting the rows with 'attempts' < 2 and 'score' > 15

Unnamed: 0,name,score,attempts,qualify
j,Jonas,19.0,1,yes


## Adding and Dropping column

**a. Adding a new row to the data frame**

In [11]:
df.loc['k'] = [1,"Suresh",'yes',15.5]

**b. Dropping the newly added row in the data frame**

In [13]:
df.drop('k')

Unnamed: 0,name,score,attempts,qualify
a,Anastasia,12.5,1,yes
b,Dima,9.0,3,no
c,Katherine,16.5,2,yes
d,James,,3,no
e,Emily,9.0,2,no
f,Michael,20.0,3,yes
g,Matthew,14.5,1,yes
h,Laura,,1,no
i,Kevin,8.0,2,no
j,Jonas,19.0,1,yes


**c. Dropping the columns from the data frame**

In [14]:
df.drop('attempts',1)

Unnamed: 0,name,score,qualify
a,Anastasia,12.5,yes
b,Dima,9,no
c,Katherine,16.5,yes
d,James,,no
e,Emily,9,no
f,Michael,20,yes
g,Matthew,14.5,yes
h,Laura,,no
i,Kevin,8,no
j,Jonas,19,yes


**d. Adding a new column to the data frame**

In [15]:
color = ['Red','Blue','Orange','Red','White','White','Blue','Green','Green','Red']
df['color'] = color
df

ValueError: Length of values does not match length of index