# Creating and Modifying Columns

- [Download the lecture notes](https://philchodrow.github.io/PIC16A/content/pd/pd_3.ipynb). 

In this set of notes, we'll learn how to operate on and create new columns in data frames. The good news is that columns in data frames are very similar to `numpy` arrays. This means that they support a wide variety of efficient vectorized operations. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
penguins = pd.read_csv("palmer_penguins.csv")

cols = ["Species", "Region", "Island", "Culmen Length (mm)", "Culmen Depth (mm)"]

penguins = penguins[cols]
penguins.head()

Unnamed: 0,Species,Region,Island,Culmen Length (mm),Culmen Depth (mm)
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1,18.7
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5,17.4
2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,40.3,18.0
3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,,
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7,19.3


The easiest way to add a new column is using the syntax `df["column name"]`. 

In [3]:
penguins["Culmen Length (cm)"] = penguins["Culmen Length (mm)"] / 10
penguins.head()

Unnamed: 0,Species,Region,Island,Culmen Length (mm),Culmen Depth (mm),Culmen Length (cm)
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1,18.7,3.91
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5,17.4,3.95
2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,40.3,18.0,4.03
3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,,,
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7,19.3,3.67


In [4]:
penguins["Anvers"] = penguins["Region"] == "Anvers"
penguins.head()

Unnamed: 0,Species,Region,Island,Culmen Length (mm),Culmen Depth (mm),Culmen Length (cm),Anvers
0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.1,18.7,3.91,True
1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,39.5,17.4,3.95,True
2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,40.3,18.0,4.03,True
3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,,,,True
4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,36.7,19.3,3.67,True


## Text Columns

If you remember your operations from `numpy`, you are actually good to go on operations involving numerical columns. As we saw above, you can essentially create `pandas` columns as `np.array`s. 

`pandas` also enables flexible, vectorized processing of text data. The required methods are similar to what you would use for simple strings, with the additional wrinkle that these methods are hiding behind a special `str` attribute. Let's look at some examples, with an eventual goal of cleaning up the very long species names of the penguins. 

In [5]:
penguins["Species"]

0      Adelie Penguin (Pygoscelis adeliae)
1      Adelie Penguin (Pygoscelis adeliae)
2      Adelie Penguin (Pygoscelis adeliae)
3      Adelie Penguin (Pygoscelis adeliae)
4      Adelie Penguin (Pygoscelis adeliae)
                      ...                 
339      Gentoo penguin (Pygoscelis papua)
340      Gentoo penguin (Pygoscelis papua)
341      Gentoo penguin (Pygoscelis papua)
342      Gentoo penguin (Pygoscelis papua)
343      Gentoo penguin (Pygoscelis papua)
Name: Species, Length: 344, dtype: object

In [6]:
# "get" the first letter -- vectorized version of "my_string"[0]
penguins["Species"].str.get(0)

0      A
1      A
2      A
3      A
4      A
      ..
339    G
340    G
341    G
342    G
343    G
Name: Species, Length: 344, dtype: object

In [7]:
penguins["Species"].str.upper()

0      ADELIE PENGUIN (PYGOSCELIS ADELIAE)
1      ADELIE PENGUIN (PYGOSCELIS ADELIAE)
2      ADELIE PENGUIN (PYGOSCELIS ADELIAE)
3      ADELIE PENGUIN (PYGOSCELIS ADELIAE)
4      ADELIE PENGUIN (PYGOSCELIS ADELIAE)
                      ...                 
339      GENTOO PENGUIN (PYGOSCELIS PAPUA)
340      GENTOO PENGUIN (PYGOSCELIS PAPUA)
341      GENTOO PENGUIN (PYGOSCELIS PAPUA)
342      GENTOO PENGUIN (PYGOSCELIS PAPUA)
343      GENTOO PENGUIN (PYGOSCELIS PAPUA)
Name: Species, Length: 344, dtype: object

In [8]:
penguins["Species"].str.split()

0      [Adelie, Penguin, (Pygoscelis, adeliae)]
1      [Adelie, Penguin, (Pygoscelis, adeliae)]
2      [Adelie, Penguin, (Pygoscelis, adeliae)]
3      [Adelie, Penguin, (Pygoscelis, adeliae)]
4      [Adelie, Penguin, (Pygoscelis, adeliae)]
                         ...                   
339      [Gentoo, penguin, (Pygoscelis, papua)]
340      [Gentoo, penguin, (Pygoscelis, papua)]
341      [Gentoo, penguin, (Pygoscelis, papua)]
342      [Gentoo, penguin, (Pygoscelis, papua)]
343      [Gentoo, penguin, (Pygoscelis, papua)]
Name: Species, Length: 344, dtype: object

In [9]:
# can layer operations, but need to use .str. each time
penguins["Species"].str.split().str.get(0)

0      Adelie
1      Adelie
2      Adelie
3      Adelie
4      Adelie
        ...  
339    Gentoo
340    Gentoo
341    Gentoo
342    Gentoo
343    Gentoo
Name: Species, Length: 344, dtype: object

In [10]:
penguins["Species"] = penguins["Species"].str.split().str.get(0)
penguins.head()

Unnamed: 0,Species,Region,Island,Culmen Length (mm),Culmen Depth (mm),Culmen Length (cm),Anvers
0,Adelie,Anvers,Torgersen,39.1,18.7,3.91,True
1,Adelie,Anvers,Torgersen,39.5,17.4,3.95,True
2,Adelie,Anvers,Torgersen,40.3,18.0,4.03,True
3,Adelie,Anvers,Torgersen,,,,True
4,Adelie,Anvers,Torgersen,36.7,19.3,3.67,True


## Externally Setting Columns

It is also possible to add columns without referencing any objects in the data frame. This is not usually recommended, but may be helpful on certain occasions. There are two approaches. The first is to add a column using a list or similar collection of length equal to the number of rows in the data frame. 

In [11]:
len(penguins)

344

In [12]:
penguins["counter"] = range(len(penguins))

In [13]:
penguins.head()

Unnamed: 0,Species,Region,Island,Culmen Length (mm),Culmen Depth (mm),Culmen Length (cm),Anvers,counter
0,Adelie,Anvers,Torgersen,39.1,18.7,3.91,True,0
1,Adelie,Anvers,Torgersen,39.5,17.4,3.95,True,1
2,Adelie,Anvers,Torgersen,40.3,18.0,4.03,True,2
3,Adelie,Anvers,Torgersen,,,,True,3
4,Adelie,Anvers,Torgersen,36.7,19.3,3.67,True,4


The second approach is to use a single value, such as a string or `int`. In this case, all rows will have the same value in the corresponding column. 

In [14]:
penguins["picard"] = "picard"

In [15]:
penguins

Unnamed: 0,Species,Region,Island,Culmen Length (mm),Culmen Depth (mm),Culmen Length (cm),Anvers,counter,picard
0,Adelie,Anvers,Torgersen,39.1,18.7,3.91,True,0,picard
1,Adelie,Anvers,Torgersen,39.5,17.4,3.95,True,1,picard
2,Adelie,Anvers,Torgersen,40.3,18.0,4.03,True,2,picard
3,Adelie,Anvers,Torgersen,,,,True,3,picard
4,Adelie,Anvers,Torgersen,36.7,19.3,3.67,True,4,picard
...,...,...,...,...,...,...,...,...,...
339,Gentoo,Anvers,Biscoe,,,,True,339,picard
340,Gentoo,Anvers,Biscoe,46.8,14.3,4.68,True,340,picard
341,Gentoo,Anvers,Biscoe,50.4,15.7,5.04,True,341,picard
342,Gentoo,Anvers,Biscoe,45.2,14.8,4.52,True,342,picard
