## 2.5 MAKING CHANGES TO SERIES AND DATAFRAMES

Now that we know various ways of subsetting and slicing our data (see Table 2.3), we should be able to alter our data objects.

2.5.1 Add Additional Columns

The type of the Born and Died columns is object, meaning they are strings.

In [1]:
import pandas as pd

In [2]:
scientists = pd.read_csv('data/scientists.csv')

In [5]:
print(scientists)

                   Name        Born        Died  Age          Occupation
0     Rosaline Franklin  1920-07-25  1958-04-16   37             Chemist
1        William Gosset  1876-06-13  1937-10-16   61        Statistician
2  Florence Nightingale  1820-05-12  1910-08-13   90               Nurse
3           Marie Curie  1867-11-07  1934-07-04   66             Chemist
4         Rachel Carson  1907-05-27  1964-04-14   56           Biologist
5             John Snow  1813-03-15  1858-06-16   45           Physician
6           Alan Turing  1912-06-23  1954-06-07   41  Computer Scientist
7          Johann Gauss  1777-04-30  1855-02-23   77       Mathematician


In [3]:
ages = scientists['Age']

print(ages)

0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64


In [4]:
print(scientists['Born'].dtype)

object


In [6]:
print(scientists['Died'].dtype)

object


We can convert the strings to a proper datetime type so we can perform common date and time operations (e.g., take differences between dates or calculate a person’s age). You can provide your own format if you have a date that has a specific format. A list of format variables can be found in the Python datetime module documentation.7 More examples with datetimes can be found in Chapter 11. The format of our date looks like “YYYY-MM-DD,” so we can use the ‘%Y-%m-%d’ format.

7. datetime module documentation: https://docs.python.org/3.5/library/datetime.html#strftime-and-strptime-behavior

In [7]:
# format the 'Born' column as a datetime

born_datetime = pd.to_datetime(scientists['Born'], format='%Y-%m-%d')

print(born_datetime)

0   1920-07-25
1   1876-06-13
2   1820-05-12
3   1867-11-07
4   1907-05-27
5   1813-03-15
6   1912-06-23
7   1777-04-30
Name: Born, dtype: datetime64[ns]


In [8]:
# format the 'Died' column as a datetime

died_datetime = pd.to_datetime(scientists['Died'], format='%Y-%m-%d')

print(died_datetime)

0   1958-04-16
1   1937-10-16
2   1910-08-13
3   1934-07-04
4   1964-04-14
5   1858-06-16
6   1954-06-07
7   1855-02-23
Name: Died, dtype: datetime64[ns]


If we wanted, we could create a new set of columns that contain the datetime representations of the object (string) dates. The below example uses python’s multiple assignment syntax (Appendix Q).

In [9]:
scientists['born_dt'], scientists['died_dt'] = (born_datetime,
                                                died_datetime)

print(scientists.head())

                   Name        Born        Died  Age    Occupation    born_dt  \
0     Rosaline Franklin  1920-07-25  1958-04-16   37       Chemist 1920-07-25   
1        William Gosset  1876-06-13  1937-10-16   61  Statistician 1876-06-13   
2  Florence Nightingale  1820-05-12  1910-08-13   90         Nurse 1820-05-12   
3           Marie Curie  1867-11-07  1934-07-04   66       Chemist 1867-11-07   
4         Rachel Carson  1907-05-27  1964-04-14   56     Biologist 1907-05-27   

     died_dt  
0 1958-04-16  
1 1937-10-16  
2 1910-08-13  
3 1934-07-04  
4 1964-04-14  


In [10]:
print(scientists.shape)

(8, 7)


2.5.2 Directly Change a Column

We can also assign a new value directly to the existing column. The example in this section shows how to randomize the contents of a column. More complex calculations that involve multiple columns can be seen in Chapter 9, in the discussion of the apply method.

First, let’s look at the original Age values.

In [11]:
print(scientists['Age'])

0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64


Now let’s shuffle the values.

In [12]:
import random

# set a seed so the randomness is always the same

random.seed(42)

random.shuffle(scientists['Age'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  x[i], x[j] = x[j], x[i]


In [13]:
print(scientists['Age'])

0    66
1    56
2    41
3    77
4    90
5    45
6    37
7    61
Name: Age, dtype: int64


The SettingWithCopyWarning message8 in the previous code tells us that the proper way of handling the statement would be to write it using loc, or we can use the built-in sample method to randomly sample the length of the column.

8. Indexing view versus copy: https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In this example, you need to reset_index since sample picks out only the row index. Thus, if you try to reassign it or use it again, the “scrambled” values will automatically align to the index and order themselves back to the pre-sample order. The drop=True parameter in reset_index tells Pandas not to insert the index into the dataframe columns, so that only the values are kept.

In [14]:
# the random_state is used to keep the 'randomization' less random

scientists['Age'] = scientists['Age'].\
    sample(len(scientists['Age']), random_state=24).\
    reset_index(drop=True) # values stay randomized

# we shuffled this column twice

print(scientists['Age'])

0    61
1    45
2    37
3    90
4    56
5    66
6    77
7    41
Name: Age, dtype: int64


In [15]:
print(scientists)

                   Name        Born        Died  Age          Occupation  \
0     Rosaline Franklin  1920-07-25  1958-04-16   61             Chemist   
1        William Gosset  1876-06-13  1937-10-16   45        Statistician   
2  Florence Nightingale  1820-05-12  1910-08-13   37               Nurse   
3           Marie Curie  1867-11-07  1934-07-04   90             Chemist   
4         Rachel Carson  1907-05-27  1964-04-14   56           Biologist   
5             John Snow  1813-03-15  1858-06-16   66           Physician   
6           Alan Turing  1912-06-23  1954-06-07   77  Computer Scientist   
7          Johann Gauss  1777-04-30  1855-02-23   41       Mathematician   

     born_dt    died_dt  
0 1920-07-25 1958-04-16  
1 1876-06-13 1937-10-16  
2 1820-05-12 1910-08-13  
3 1867-11-07 1934-07-04  
4 1907-05-27 1964-04-14  
5 1813-03-15 1858-06-16  
6 1912-06-23 1954-06-07  
7 1777-04-30 1855-02-23  


Notice that the random.shuffle method seems to work directly on the column. The documentation for random.shuffle9 mentions that the sequence will be shuffled “in place,” meaning that it will work directly on the sequence. Contrast this with the previous method, in which we assigned the newly calculated values to a separate variable before we could assign them to the column.

9. Random shuffle: https://docs.python.org/3.6/library/random.html#random.shuffle

We can recalculate the “real” age using datetime arithmetic. More information about datetime can be found in Chapter 11.

In [16]:
# subtracting dates gives the number of days

scientists['age_days_dt'] = (scientists['died_dt'] - \
                             scientists['born_dt'])

print(scientists)

                   Name        Born        Died  Age          Occupation  \
0     Rosaline Franklin  1920-07-25  1958-04-16   61             Chemist   
1        William Gosset  1876-06-13  1937-10-16   45        Statistician   
2  Florence Nightingale  1820-05-12  1910-08-13   37               Nurse   
3           Marie Curie  1867-11-07  1934-07-04   90             Chemist   
4         Rachel Carson  1907-05-27  1964-04-14   56           Biologist   
5             John Snow  1813-03-15  1858-06-16   66           Physician   
6           Alan Turing  1912-06-23  1954-06-07   77  Computer Scientist   
7          Johann Gauss  1777-04-30  1855-02-23   41       Mathematician   

     born_dt    died_dt age_days_dt  
0 1920-07-25 1958-04-16  13779 days  
1 1876-06-13 1937-10-16  22404 days  
2 1820-05-12 1910-08-13  32964 days  
3 1867-11-07 1934-07-04  24345 days  
4 1907-05-27 1964-04-14  20777 days  
5 1813-03-15 1858-06-16  16529 days  
6 1912-06-23 1954-06-07  15324 days  
7 1777-04-3

In [17]:
# we can convert the value to just the year
# using the astype method

scientists['age_years_dt'] = scientists['age_days_dt'].\
    astype('timedelta64[Y]')

print(scientists)

                   Name        Born        Died  Age          Occupation  \
0     Rosaline Franklin  1920-07-25  1958-04-16   61             Chemist   
1        William Gosset  1876-06-13  1937-10-16   45        Statistician   
2  Florence Nightingale  1820-05-12  1910-08-13   37               Nurse   
3           Marie Curie  1867-11-07  1934-07-04   90             Chemist   
4         Rachel Carson  1907-05-27  1964-04-14   56           Biologist   
5             John Snow  1813-03-15  1858-06-16   66           Physician   
6           Alan Turing  1912-06-23  1954-06-07   77  Computer Scientist   
7          Johann Gauss  1777-04-30  1855-02-23   41       Mathematician   

     born_dt    died_dt age_days_dt  age_years_dt  
0 1920-07-25 1958-04-16  13779 days          37.0  
1 1876-06-13 1937-10-16  22404 days          61.0  
2 1820-05-12 1910-08-13  32964 days          90.0  
3 1867-11-07 1934-07-04  24345 days          66.0  
4 1907-05-27 1964-04-14  20777 days          56.0  
5 1

Many functions and methods in pandas will have an inplace parameter that you can set to True, if you want to perform the action “in place.” This will directly change the given column without returning anything.

Note

We could have directly assigned the column to the datetime that was converted, but the point is that an assignment still needed to be performed. The random.shuffle example performs its method “in place,” so there is nothing that is explicitly returned from the function. The value passed into the function is directly manipulated.

2.5.3 Dropping Values

To drop a column, we can either select all the columns we want to by using the column subsetting techniques, or select columns to drop with the drop method on our dataframe.10

10. Drop method: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

In [18]:
# all the current columns in our data

print(scientists.columns)

Index(['Name', 'Born', 'Died', 'Age', 'Occupation', 'born_dt', 'died_dt',
       'age_days_dt', 'age_years_dt'],
      dtype='object')


In [19]:
# drop the shuffled age column
# you provide the axis=1 argument to drop column-wise

scientists_dropped = scientists.drop(['Age'], axis=1)

# columns after dropping our column

print(scientists_dropped.columns)

Index(['Name', 'Born', 'Died', 'Occupation', 'born_dt', 'died_dt',
       'age_days_dt', 'age_years_dt'],
      dtype='object')
