# Advanced Applications of Mutate

## Map and apply

In [1]:
import pandas as pd
from dfply import *
import matplotlib.pylab as plt
%matplotlib inline

## Hiding stack traceback

We hide the exception traceback for didactic reasons (code source: [see this post](https://stackoverflow.com/questions/46222753/how-do-i-suppress-tracebacks-in-jupyter)).  Don't run this cell if you want to see a full traceback.

In [2]:
import sys
ipython = get_ipython()

def hide_traceback(exc_tuple=None, filename=None, tb_offset=None,
                   exception_only=False, running_compiled_code=False):
    etype, value, tb = sys.exc_info()
    return ipython._showtraceback(etype, value, ipython.InteractiveTB.get_exception_only(etype, value))

ipython.showtraceback = hide_traceback

## Data set

We will be using two of the data sets provided by the Museam of Modern Art (MoMA) in this lecture.  Make sure that you have downloaded each repository.  [Download Instructions](./get_MOMA_data.ipynb)

#### MoMA Artists

In [3]:
artists = pd.read_csv("./data/Artists.csv")
artists.head(2)

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,


# Transforming columns with the `map` and `apply` methods

Next, we will take a look at two useful `pandas Series` methods that allow us to apply very general transformations: `map` and `apply`.

## Transforming a column with `map`

`df.col.map` can be used to

* Apply a translation `dict`
* Apply a function
* Apply a `pd.Series`

In [4]:
artists.Gender.value_counts()

Male          9762
Female        2300
male            15
Non-Binary       2
female           1
Non-binary       1
Name: Gender, dtype: int64

#### `map`ping a translation `dict`

In [5]:
new_gender = {'Male':'m', 'Female':'f', 'male':'m', 'female':'f', 'Non-Binary':'nb', 'Non-binary':'nb'}
(artists
 >> select(X.Gender)
 >> mutate(new_gender = X.Gender.map(new_gender))
 >> head(9)
)

Unnamed: 0,Gender,new_gender
0,Male,m
1,Male,m
2,Male,m
3,Male,m
4,Male,m
5,Male,m
6,Male,m
7,Male,m
8,Female,f


#### Setting a default with `collections.defaultdict`

In [6]:
from collections import defaultdict

from_america = defaultdict(lambda: 'Not America')
from_america.update({'American':'America'})

#### Applying the `defaultdict`

In [7]:
(artists
 >> select(X.Nationality)
 >> mutate(from_america = X.Nationality.map(from_america))
 >> head(3)
)

Unnamed: 0,Nationality,from_america
0,American,America
1,Spanish,Not America
2,American,America


#### `map`ping a simple function

In [8]:
(artists
 >> select(X.Nationality)
 >> mutate(from_USA = X.Nationality.map(lambda n: 'USA' if n == 'American' else 'Other'))
 >> head(3)
)

Unnamed: 0,Nationality,from_USA
0,American,USA
1,Spanish,Other
2,American,USA


## Be sure to `apply` yourself!

* `df.col.apply` is used to apply any function to a column.
    * Including positional and keyword arguments
* Could literally be used to perform *any* mutation

#### Applying a unary function

In [9]:
century = lambda year_string: (int(year_string)//100)*100

(artists
 >> select(X.BeginDate)
 >> mutate(century_of_birth = X.BeginDate.apply(century))
 >> head(3)
)

Unnamed: 0,BeginDate,century_of_birth
0,1930,1900
1,1936,1900
2,1941,1900


## Using anonymous functions

* There is no need to name a `lambda`
* An embedded `lambda` is called an **anonymous function**

In [10]:
(artists
 >> select(X.EndDate)
 >> mutate(new_end_date = (X.EndDate
                           .apply(lambda y: y if int(y) > 0 else np.nan)
                           .astype('Int64')))
 >> head(2)
)

Unnamed: 0,EndDate,new_end_date
0,1992,1992.0
1,0,


## `apply` or `map`

* Use `map` for simple functions
* Use `apply` when adding additional arguments

#### MoMA Artwork

In [11]:
from more_dfply import fix_names

artwork = (pd.read_csv("./data/Artworks.csv")
           >> fix_names
           >> mutate(id = X.index + 1)
          )
artwork.head(2)

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,Circumference_cm,Depth_cm,Diameter_cm,Height_cm,Length_cm,Weight_kg,Width_cm,Seat_Height_cm,Duration_sec,id
0,"Ferdinandsbrücke Project, Vienna, Austria (Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,,,,48.6,,,168.9,,,1
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,,,,40.6401,,,29.8451,,,2


#### Setting a positional argument

We want to apply `round(val, 1)`


In [12]:
(artwork
 >> select(X.Height_cm)
 >> mutate(rounded_height = X.Height_cm.apply(round, args=(1,)))
 >> head(3)
)

Unnamed: 0,Height_cm,rounded_height
0,48.6,48.6
1,40.6401,40.6
2,34.3,34.3


#### Setting a keyword argument

We want to apply `logp1(val, base=n)`

In [13]:
from math import log, e

log1p = lambda num, base=e: log(num + 1, base)
(artwork
 >> select(X.Height_cm)
 >> mutate(log10_plus_1 = X.Height_cm.apply(log1p, base = 10),
           log2_plus_1 = X.Height_cm.apply(log1p, base = 2),
           ln_plus_1 = X.Height_cm.apply(log1p, base = e))
 >> head(3)
)

Unnamed: 0,Height_cm,log10_plus_1,log2_plus_1,ln_plus_1
0,48.6,1.695482,5.632268,3.903991
1,40.6401,1.619512,5.379902,3.729064
2,34.3,1.547775,5.141596,3.563883


## <font color="red"> Exercise 2 </font>

An **Indicator column** for a category contains 1 for the rows that match that label and 0 otherwise.  The `exhibitions` dataframe.  Complete the following tasks.

1. Use `exhibitions.ExhibitionRole.unique()` to get a list of unique columns.
2. Use `mutate` and `map` with `defaultdict` to create an indicator column for each category (ignore missing rows).
3. Comment on the quality of your solution, especially in light of the [DRY principle](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)

#### MoMA Exhibitions

In [70]:
dat_cols = ['ExhibitionBeginDate', 'ExhibitionEndDate', 'ConstituentBeginDate' ,'ConstituentEndDate']
exhibitions = pd.read_csv('./data/MoMAExhibitions1929to1989.csv', 
                          encoding="ISO-8859-1",
                          parse_dates=dat_cols)
exhibitions.head(2)
exhibitions

Unnamed: 0,ExhibitionID,ExhibitionNumber,ExhibitionTitle,ExhibitionCitationDate,ExhibitionBeginDate,ExhibitionEndDate,ExhibitionSortOrder,ExhibitionURL,ExhibitionRole,ExhibitionRoleinPressRelease,...,Institution,Nationality,ConstituentBeginDate,ConstituentEndDate,ArtistBio,Gender,VIAFID,WikidataID,ULANID,ConstituentURL
0,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",1929-11-07,1929-12-07,1.0,moma.org/calendar/exhibitions/1767,Curator,Director,...,,American,1902,1981,"American, 19021981",Male,109252853.0,Q711362,500241556.0,moma.org/artists/9168
1,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",1929-11-07,1929-12-07,1.0,moma.org/calendar/exhibitions/1767,Artist,Artist,...,,French,1839,1906,"French, 18391906",Male,39374836.0,Q35548,500004793.0,moma.org/artists/1053
2,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",1929-11-07,1929-12-07,1.0,moma.org/calendar/exhibitions/1767,Artist,Artist,...,,French,1848,1903,"French, 18481903",Male,27064953.0,Q37693,500011421.0,moma.org/artists/2098
3,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",1929-11-07,1929-12-07,1.0,moma.org/calendar/exhibitions/1767,Artist,Artist,...,,Dutch,1853,1890,"Dutch, 18531890",Male,9854560.0,Q5582,500115588.0,moma.org/artists/2206
4,2557.0,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",1929-11-07,1929-12-07,1.0,moma.org/calendar/exhibitions/1767,Artist,Artist,...,,French,1859,1891,"French, 18591891",Male,24608076.0,Q34013,500008873.0,moma.org/artists/5358
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34553,288.0,1536,Recent Japanese Posters from the Collection,"[MoMA Exh. #1536, December 9, 1989-April 16, 1...",1989-12-09,1990-04-16,1767.0,moma.org/calendar/exhibitions/1739,Artist,Artist,...,,Japanese,1942,,"Japanese, born 1942",Male,18484958.0,,,moma.org/artists/6215
34554,288.0,1536,Recent Japanese Posters from the Collection,"[MoMA Exh. #1536, December 9, 1989-April 16, 1...",1989-12-09,1990-04-16,1767.0,moma.org/calendar/exhibitions/1739,Artist,Artist,...,,Japanese,1943,,"Japanese, born 1943",Male,,,,moma.org/artists/6486
34555,288.0,1536,Recent Japanese Posters from the Collection,"[MoMA Exh. #1536, December 9, 1989-April 16, 1...",1989-12-09,1990-04-16,1767.0,moma.org/calendar/exhibitions/1739,Artist,Artist,...,,Japanese,1920,,"Japanese, born 1920",Male,119202488.0,Q2178400,,moma.org/artists/6487
34556,288.0,1536,Recent Japanese Posters from the Collection,"[MoMA Exh. #1536, December 9, 1989-April 16, 1...",1989-12-09,1990-04-16,1767.0,moma.org/calendar/exhibitions/1739,Artist,Artist,...,,Japanese,1936,,"Japanese, born 1936",Male,96086073.0,Q3513688,500060125.0,moma.org/artists/6502


In [24]:
exhibitions.ExhibitionRole.unique()

array(['Curator', 'Artist', nan, 'Arranger', 'Installer',
       'Competition Judge', 'Designer', 'Preparer'], dtype=object)

In [58]:
nan_rows = exhibitions[exhibitions['ExhibitionRole'].isnull()]

In [59]:
nan_rows

Unnamed: 0,ExhibitionID,ExhibitionNumber,ExhibitionTitle,ExhibitionCitationDate,ExhibitionBeginDate,ExhibitionEndDate,ExhibitionSortOrder,ExhibitionURL,ExhibitionRole,ExhibitionRoleinPressRelease,...,Institution,Nationality,ConstituentBeginDate,ConstituentEndDate,ArtistBio,Gender,VIAFID,WikidataID,ULANID,ConstituentURL
547,2986.0,21,Poster Competition,"[MoMA Exh. #21, February 25-March 12, 1933]",1933-02-25,1933-03-12,22.0,moma.org/calendar/exhibitions/2049,,,...,,,,,,,,,,
607,2993.0,25a,Typography Competition,"[MoMA Exh. #25a, March 27-April 6, 1933]",1933-03-27,1933-04-06,29.0,moma.org/calendar/exhibitions/2054,,,...,,,,,,,,,,
622,2995.0,26a,The Museum Collection: Painting and Sculpture,"[MoMA Exh. #26a, March 27-April 25, 1933]",1933-03-27,1933-04-25,31.0,moma.org/calendar/exhibitions/2056,,,...,,,,,,,,,,
1172,3010.0,34e,Westchester Folk Art Exhibition,"[MoMA Exh. #34e, June 23-July 9, 1934]",1934-06-23,1934-07-09,52.0,moma.org/calendar/exhibitions/2933,,,...,,,,,,,,,,
1196,3012.0,34h,The Making of a Museum Publication,"[MoMA Exh. #34h, September 11-October 7, 1934]",1934-09-11,1934-10-07,55.0,moma.org/calendar/exhibitions/2934,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33373,,No#,Architecture and Design Permanent Gallery Rein...,[Winter 1987-88],1987-12-01,NaT,1691.0,,,,...,,,,,,,,,,
33605,,1485,"Painting and Sculpture, New Reinstallation","[MoMA Exh. #1485, December 24, 1987-September ...",1987-12-24,1988-09-12,1706.0,,,,...,,,,,,,,,,
34243,,No#,Architecture and Design Permanent Collection G...,[Spring 1989],1989-04-01,NaT,1741.0,,,,...,,,,,,,,,,
34507,,No#,Architecture and Design Permanent Collection G...,[November 1989 - no closing date],1989-11-01,NaT,1764.0,,,,...,,,,,,,,,,


In [54]:
help(exhibitions.filter)

Help on method filter in module pandas.core.generic:

filter(items=None, like: 'Optional[str]' = None, regex: 'Optional[str]' = None, axis=None) -> 'FrameOrSeries' method of pandas.core.frame.DataFrame instance
    Subset the dataframe rows or columns according to the specified index labels.
    
    Note that this routine does not filter a dataframe on its
    contents. The filter is applied to the labels of the index.
    
    Parameters
    ----------
    items : list-like
        Keep labels from axis which are in items.
    like : str
        Keep labels from axis for which "like in label == True".
    regex : str (regular expression)
        Keep labels from axis for which re.search(regex, label) == True.
    axis : {0 or ‘index’, 1 or ‘columns’, None}, default None
        The axis to filter on, expressed either as an index (int)
        or axis name (str). By default this is the info axis,
        'index' for Series, 'columns' for DataFrame.
    
    Returns
    -------
    sam

In [67]:
# Your code here

is_Curator = defaultdict(lambda: 0)
is_Curator.update({'Curator':1})

is_Artist = defaultdict(lambda: 0)
is_Artist.update({'Artist':1})

is_Arranger = defaultdict(lambda: 0)
is_Arranger.update({'Arranger':1})

is_Installer = defaultdict(lambda: 0)
is_Installer.update({'Installer':1})

is_Competition_Judge = defaultdict(lambda: 0)
is_Competition_Judge.update({'Competition Judge':1})

is_Designer = defaultdict(lambda: 0)
is_Designer.update({'Designer':1})

is_Preparer = defaultdict(lambda: 0)
is_Preparer.update({'Preparer':1})

In [68]:
Exibitions_Indicators = (exhibitions
 >> select(X.ExhibitionRole)
 >> mutate(Curator = X.ExhibitionRole.map(is_Curator))
 >> mutate(Artist = X.ExhibitionRole.map(is_Artist))
 >> mutate(Arranger = X.ExhibitionRole.map(is_Arranger))
 >> mutate(Installer = X.ExhibitionRole.map(is_Installer))
 >> mutate(Competition_Judge = X.ExhibitionRole.map(is_Competition_Judge))
 >> mutate(Designer = X.ExhibitionRole.map(is_Designer))
 >> mutate(Preparer = X.ExhibitionRole.map(is_Preparer))
 >> head(200)
) 

In [69]:
Exibitions_Indicators

Unnamed: 0,ExhibitionRole,Curator,Artist,Arranger,Installer,Competition_Judge,Designer,Preparer
0,Curator,1,0,0,0,0,0,0
1,Artist,0,1,0,0,0,0,0
2,Artist,0,1,0,0,0,0,0
3,Artist,0,1,0,0,0,0,0
4,Artist,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...
195,Artist,0,1,0,0,0,0,0
196,Artist,0,1,0,0,0,0,0
197,Artist,0,1,0,0,0,0,0
198,Artist,0,1,0,0,0,0,0


The above table has indicated each role and completely ignored any row with NaN values

I do not think this is a Dry Solution as layed out in the provided document, as we with this structure we say no too many times to fit that bill. While it is true that we could remove the source column and remove the redundancy there, we still would have redundant "No" values. With the base column, we directly indacte what somebody is by simple stating it out right. It is done once, completely, and defineivley. But in our indicataive solution, Not only are we saying that a person is Curator, we are also saying that they are not an artist, arranger, judge, etc. To me, this seems much more like WET (Write Every Time) approach. That is not to say having dummy variable is inherenetly bad, as having the data as such is required for certain analytical methods. But as efficent storage or being "Clean" it seems far less effective.