# Creating and Modifying Columns

In this set of notes, we'll learn how to operate on and create new columns in data frames. The good news is that columns in data frames are very similar to `numpy` arrays. This means that they support a wide variety of efficient vectorized operations.

In [1]:
#standard imports
import numpy as np
import pandas as pd
#read data from csv into a pandas data frame
penguins=pd.read_csv("palmer_penguins.csv")

In this video, we will restrict attention to five columns,, Species, Region, Island, Culmen Length (mm) and Culmen Depth (mm)

In [2]:
cols = ["Species", "Region", "Island", "Culmen Length (mm)", "Culmen Depth (mm)"]
penguins = penguins[cols]



Let's take a look at the first five rows

In [3]:
penguins.head()

Unnamed: 0,Species,Region,Island,Culmen Length (mm),Culmen Depth (mm)
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1,18.7
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5,17.4
2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,40.3,18.0
3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,,
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7,19.3


We can add new columns in a manner similar to adding entries to a dictionary. Let's start by adding Culmen Length in centimeters

In [4]:
penguins["Culmen Length (cm)"] =  penguins["Culmen Length (mm)"]/10
penguins.head()

Unnamed: 0,Species,Region,Island,Culmen Length (mm),Culmen Depth (mm),Culmen Length (cm)
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1,18.7,3.91
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5,17.4,3.95
2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,40.3,18.0,4.03
3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,,,
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7,19.3,3.67


We can also add boolean columns. For example, we can add a column tracking if the Culmen Length is greater than 40 mm

In [5]:
penguins["Long Culmen?"]=penguins["Culmen Length (mm)"]>40
penguins.head()

Unnamed: 0,Species,Region,Island,Culmen Length (mm),Culmen Depth (mm),Culmen Length (cm),Long Culmen?
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1,18.7,3.91,False
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5,17.4,3.95,False
2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,40.3,18.0,4.03,True
3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,,,,False
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7,19.3,3.67,False


## Text Columns

If you remember your operations from `numpy`, you are actually good to go on operations involving numerical columns. As we saw above, you can essentially create `pandas` columns as `np.array`s. 

`pandas` also enables flexible, vectorized processing of text data. The required methods are similar to what you would use for simple strings, with the additional wrinkle that these methods are hiding behind a special `str` attribute. Let's look at some examples, with an eventual goal of cleaning up the very long species names of the penguins. 

In [6]:
#let's look at the species column
penguins["Species"]

0      Adelie Penguin (Pygoscelis adeliae)
1      Adelie Penguin (Pygoscelis adeliae)
2      Adelie Penguin (Pygoscelis adeliae)
3      Adelie Penguin (Pygoscelis adeliae)
4      Adelie Penguin (Pygoscelis adeliae)
                      ...                 
339      Gentoo penguin (Pygoscelis papua)
340      Gentoo penguin (Pygoscelis papua)
341      Gentoo penguin (Pygoscelis papua)
342      Gentoo penguin (Pygoscelis papua)
343      Gentoo penguin (Pygoscelis papua)
Name: Species, Length: 344, dtype: object

We can apply mehods of the string class to the species column, but we need an extra .str

In [8]:
#All caps with the upper method
penguins["Species"].str.upper().head()

0    ADELIE PENGUIN (PYGOSCELIS ADELIAE)
1    ADELIE PENGUIN (PYGOSCELIS ADELIAE)
2    ADELIE PENGUIN (PYGOSCELIS ADELIAE)
3    ADELIE PENGUIN (PYGOSCELIS ADELIAE)
4    ADELIE PENGUIN (PYGOSCELIS ADELIAE)
Name: Species, dtype: object

In [9]:
#first letter with .get(0)
penguins["Species"].str.get(0)

0      A
1      A
2      A
3      A
4      A
      ..
339    G
340    G
341    G
342    G
343    G
Name: Species, Length: 344, dtype: object

To get the first word of the species name, we first split and then we get the first entry

In [10]:
#split turns each row into a list of strings
penguins["Species"].str.split()

0      [Adelie, Penguin, (Pygoscelis, adeliae)]
1      [Adelie, Penguin, (Pygoscelis, adeliae)]
2      [Adelie, Penguin, (Pygoscelis, adeliae)]
3      [Adelie, Penguin, (Pygoscelis, adeliae)]
4      [Adelie, Penguin, (Pygoscelis, adeliae)]
                         ...                   
339      [Gentoo, penguin, (Pygoscelis, papua)]
340      [Gentoo, penguin, (Pygoscelis, papua)]
341      [Gentoo, penguin, (Pygoscelis, papua)]
342      [Gentoo, penguin, (Pygoscelis, papua)]
343      [Gentoo, penguin, (Pygoscelis, papua)]
Name: Species, Length: 344, dtype: object

In [11]:
#get the first entry
penguins["Species"].str.split().str.get(0)

0      Adelie
1      Adelie
2      Adelie
3      Adelie
4      Adelie
        ...  
339    Gentoo
340    Gentoo
341    Gentoo
342    Gentoo
343    Gentoo
Name: Species, Length: 344, dtype: object

In [12]:
#overwrite the column name
penguins["Species"]=penguins["Species"].str.split().str.get(0)

In [13]:
#look at our new data frame
penguins.head()

Unnamed: 0,Species,Region,Island,Culmen Length (mm),Culmen Depth (mm),Culmen Length (cm),Long Culmen?
0,Adelie,Anvers,Torgersen,39.1,18.7,3.91,False
1,Adelie,Anvers,Torgersen,39.5,17.4,3.95,False
2,Adelie,Anvers,Torgersen,40.3,18.0,4.03,True
3,Adelie,Anvers,Torgersen,,,,False
4,Adelie,Anvers,Torgersen,36.7,19.3,3.67,False


## Externally setting columns

New columns don't have to be in terms of existing columns. Here is an example which doesn't make sense in the context of penguins, but might in other contexts where each row is a different day.

In [14]:
#Add row numbers
penguins["row number"]=range(len(penguins))

In [15]:
#reduce modulo seven
penguins["row number"] %=7

In [16]:
penguins.loc[1:10]

Unnamed: 0,Species,Region,Island,Culmen Length (mm),Culmen Depth (mm),Culmen Length (cm),Long Culmen?,row number
1,Adelie,Anvers,Torgersen,39.5,17.4,3.95,False,1
2,Adelie,Anvers,Torgersen,40.3,18.0,4.03,True,2
3,Adelie,Anvers,Torgersen,,,,False,3
4,Adelie,Anvers,Torgersen,36.7,19.3,3.67,False,4
5,Adelie,Anvers,Torgersen,39.3,20.6,3.93,False,5
6,Adelie,Anvers,Torgersen,38.9,17.8,3.89,False,6
7,Adelie,Anvers,Torgersen,39.2,19.6,3.92,False,0
8,Adelie,Anvers,Torgersen,34.1,18.1,3.41,False,1
9,Adelie,Anvers,Torgersen,42.0,20.2,4.2,True,2
10,Adelie,Anvers,Torgersen,37.8,17.1,3.78,False,3


With other datasets, this could be used to keep track of the day of the week