# Selecting and Editing Data

### Dropping Entries

In [1]:
import numpy as np
import pandas as pd

In [3]:
# create sample series
ser1 = pd.Series(np.arange(3,), index = ['a', 'b', 'c'])
ser1

a    0
b    1
c    2
dtype: int32

We can use the drop method to remove items from a series by index. However, note that the drop method isn't permanent. If you call the original data from again, the dropped value will still be there

In [6]:
# drop index b
ser1.drop('b')

a    0
c    2
dtype: int32

In [8]:
# see that the original series remains unchanged
ser1

a    0
b    1
c    2
dtype: int32

Let's see how this works with DataFrames. 

In [11]:
# create sample DataFrame
df1 = pd.DataFrame(np.arange(9).reshape(3, 3), 
                   index = ['SF', 'LA', 'NY'],
                  columns = ['pop', 'size', 'year'])
df1

Unnamed: 0,pop,size,year
SF,0,1,2
LA,3,4,5
NY,6,7,8


In [12]:
# drop a row from df1
df1.drop('LA')

Unnamed: 0,pop,size,year
SF,0,1,2
NY,6,7,8


When dropping from a DataFrame, the default is to use axis = 0, which would be a row. When dropping a column, we need to give the column name, and specify axis = 1.

In [15]:
# drop column year from df1
df1.drop('year', axis = 1)

Unnamed: 0,pop,size
SF,0,1
LA,3,4
NY,6,7


If you tried using just df1.drop('year'), you would get an error since, there is no index called 'year'.

### Selecting Entries

In [18]:
# create a series
ser2 = 2 * ser1
ser2

a    0
b    2
c    4
dtype: int32

In [20]:
# select a value by the index
ser2['b']

2

Indexes in a pandas series can be referenced by the index value, or by a numerical value. For our ser2, index 'a' has an index number of 0, 'b' has a number of 1, and 'c' has 2. 

In [22]:
# select a value by index number
ser2[1]

2

In [24]:
# this allows us to use numerical ranges for our indices
ser2[0:2]

a    0
b    2
dtype: int32

In [25]:
# we can do something similar by passing a list of index values
ser2[['a', 'b']]

a    0
b    2
dtype: int32

In [26]:
# we can also grab from our series using some logic
# this gives you all series values greater than 3
ser2[ser2 > 3]

c    4
dtype: int32

In [29]:
# this is useful for assignment
# replace all values > 3 with 10
ser2[ser2 > 3] = 10
ser2

a     0
b     2
c    10
dtype: int32

In [30]:
# create a DataFrame
df2 = pd.DataFrame(np.arange(25).reshape(5, 5),
                  index = ['NYC', 'LA', 'SF', 'DC', 'Chi'],
                  columns = ['a', 'b', 'c', 'd', 'e'])
df2

Unnamed: 0,a,b,c,d,e
NYC,0,1,2,3,4
LA,5,6,7,8,9
SF,10,11,12,13,14
DC,15,16,17,18,19
Chi,20,21,22,23,24


In [31]:
# select data by column name
df2['b']

NYC     1
LA      6
SF     11
DC     16
Chi    21
Name: b, dtype: int32

In [32]:
# you can also use multiple column names
df2[['b', 'e']]

Unnamed: 0,b,e
NYC,1,4
LA,6,9
SF,11,14
DC,16,19
Chi,21,24


In [34]:
# we can also use logic here
# return every row where column c > 8
df2[df2['c'] > 8]

Unnamed: 0,a,b,c,d,e
SF,10,11,12,13,14
DC,15,16,17,18,19
Chi,20,21,22,23,24


In [35]:
# we can also show a boolean DataFrame by using a logical comparison
df2 > 10

Unnamed: 0,a,b,c,d,e
NYC,False,False,False,False,False
LA,False,False,False,False,False
SF,False,True,True,True,True
DC,True,True,True,True,True
Chi,True,True,True,True,True


In [36]:
# we can also use the .ix method
df2.ix['LA']

a    5
b    6
c    7
d    8
e    9
Name: LA, dtype: int32

In [38]:
# we can do the same thing using the numerical index instead of the index value
df2.ix[1]

a    5
b    6
c    7
d    8
e    9
Name: LA, dtype: int32

### Data Alignment

##### Series Alignment

In [40]:
# re-use ser1
ser1

a    0
b    1
c    2
dtype: int32

In [43]:
# create a new series
ser3 = pd.Series([3, 4, 5, 6],
                index = ['a', 'b', 'c', 'd'])
ser3

a    3
b    4
c    5
d    6
dtype: int64

In [44]:
# add the series together
ser1 + ser3

a     3
b     5
c     7
d   NaN
dtype: float64

What happened? Pandas added the values where the indexes matched. Since ser1 didn't have a 'd' index, it was replaced with a null value. NaN + anything is still Nan. 

##### DataFrame Alignment

In [48]:
# create a DataFrame
df3 = pd.DataFrame(np.arange(9).reshape(3, 3), 
                  columns = list('abc'),
                  index = ['NYC', 'LA', 'SF'])
df3

Unnamed: 0,a,b,c
NYC,0,1,2
LA,3,4,5
SF,6,7,8


In [50]:
# create another DataFrame
df4 = pd.DataFrame(np.arange(16).reshape(4, 4),
                  columns = list('acde'),
                  index = ['NYC', 'DC', 'SF', 'LA'])
df4

Unnamed: 0,a,c,d,e
NYC,0,1,2,3
DC,4,5,6,7
SF,8,9,10,11
LA,12,13,14,15


In [54]:
# add the two DataFrames
df3 + df4

Unnamed: 0,a,b,c,d,e
DC,,,,,
LA,15.0,,18.0,,
NYC,0.0,,3.0,,
SF,14.0,,17.0,,


The values only appear where the rows and columns match up, otherwise you get a NaN. However, if we call the .add() method, instead of using the plus sign, we can add an additional argument: fill_value. This allows us to set a default value for anywhere that doesn't match. In this case, the only place we still get a null value is where the column is missing from df4, and the row is missing from df3. 

In [55]:
# add the DataFrames using the .add() method
df3.add(df4, fill_value = 0)

Unnamed: 0,a,b,c,d,e
DC,4,,5,6,7
LA,15,4.0,18,14,15
NYC,0,1.0,3,2,3
SF,14,7.0,17,10,11
