# MoMA Data Cleaning and Analysis

Data about the art in the Museum of Modern Art (MoMA)

## Data headers
Column      | Index | Description
----------- | --- | -----------
Title | 0 | the title of the artwork
Artist | 1 | the name of the artist who created the artwork
Nationality | 2 |  the nationality of the artist
BeginDate | 3 | the year in which the artist was born
EndDate | 4 | the year in which the artist died
Gender | 5 | the gender of the artist
Date | 6 | the date that the artwork was created
Department | 7 | the department inside MoMA to which the artwork belongs

## Read data
Open the data file and read in the data

Print the header row, then delete it from data

In [23]:
from csv import reader
opened_file = open('artworks.csv', encoding='utf-8')
read_file = reader(opened_file)
moma = list(read_file)
print(moma[0])
moma = moma[1:]

['Title', 'Artist', 'Nationality', 'BeginDate', 'EndDate', 'Gender', 'Date', 'Department']


Print the first few rows of the data

In [24]:
def print_rows(dataset, num_rows):
    for row in dataset[:num_rows]:
        print (row, '\n')
    return

print_rows(moma, 3)

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', '(American)', '(1947)', '(2013)', '(Female)', '1986', 'Prints & Illustrated Books'] 

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA', 'Pablo Palazuelo', '(Spanish)', '(1916)', '(2007)', '(Male)', '1978', 'Prints & Illustrated Books'] 

['Tailpiece (page 55) from SAGESSE', 'Maurice Denis', '(French)', '(1870)', '(1943)', '(Male)', '1889-1911', 'Prints & Illustrated Books'] 



## Clean data
Remove the parentheses from several columns
 
Rather than re-writing code for each column as instructed in DQ, I've created a function that can be re-used for each column we want to clean

In [25]:
def remove_parens(dataset, index):
    for row in dataset:
        string = row[index]
        string = string.replace('(', '')
        string = string.replace(')', '')
        row[index] = string
    return dataset

remove_parens(moma, 2)
remove_parens(moma, 3)
remove_parens(moma, 4)
remove_parens(moma, 5)
print() # used to avoid printing the return of the final remove_parens call




Print some data to validate that parens have been removed

In [26]:
print_rows(moma, 3)

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', 'American', '1947', '2013', 'Female', '1986', 'Prints & Illustrated Books'] 

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA', 'Pablo Palazuelo', 'Spanish', '1916', '2007', 'Male', '1978', 'Prints & Illustrated Books'] 

['Tailpiece (page 55) from SAGESSE', 'Maurice Denis', 'French', '1870', '1943', 'Male', '1889-1911', 'Prints & Illustrated Books'] 



Clean up data in the Gender [5] and Nationality [2] columns

Normalize capitalization and add values to specify unknown as needed

In [27]:
for row in moma:
    gender = row[5]
    gender = gender.title()
    if not gender:
        gender = 'Gender Unknown/Other'
    row[5] = gender
    
    nat = row[2]
    nat = nat.title()
    if not nat:
        nat = 'Nationality Unknown'
    row[2] = nat

Convert dates from strings to integer values, to make them easier to work with

In [28]:
def convert_to_int(dataset, index):
    for row in dataset:
        string = row[index]
        if string != '':
            string = int(string)
            row[index] = string
    return dataset

convert_to_int(moma, 3)
convert_to_int(moma, 4)

print_rows(moma, 3)

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', 'American', 1947, 2013, 'Female', '1986', 'Prints & Illustrated Books'] 

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA', 'Pablo Palazuelo', 'Spanish', 1916, 2007, 'Male', '1978', 'Prints & Illustrated Books'] 

['Tailpiece (page 55) from SAGESSE', 'Maurice Denis', 'French', 1870, 1943, 'Male', '1889-1911', 'Prints & Illustrated Books'] 



The data for when art was created has some variances - there are some additional characters included in some rows (e.g. 'c. 1920' instead of just '1920') and some rows include a range of years instead of a specific year

Remove extra characters

In addition to the DQ exercise, added functionality to detect any characters outside a list of valid characters. '-' is included as a valid character, as it denotes a date range, which will be parsed in the next section

In [29]:
valid_chars = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0', '-']
bad_chars = []
for row in moma:
    string = row[6]
    for char in string:
        if char not in valid_chars:
            string = string.replace(char, '')
            if char not in bad_chars:
                bad_chars.append(char)
    row[6] = string
                               
print('Found the following non-valid chars: ', bad_chars, '\n')

print_rows(moma, 3)

Found the following non-valid chars:  ['(', ')', 'c', '.', ' ', 's', "'"] 

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', 'American', 1947, 2013, 'Female', '1986', 'Prints & Illustrated Books'] 

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA', 'Pablo Palazuelo', 'Spanish', 1916, 2007, 'Male', '1978', 'Prints & Illustrated Books'] 

['Tailpiece (page 55) from SAGESSE', 'Maurice Denis', 'French', 1870, 1943, 'Male', '1889-1911', 'Prints & Illustrated Books'] 



Where we have a range of dates, convert the value to the average

When the third row of data is printed, see that the date range has been converted to a single date that averages the previous range

In [30]:
def process_date(string):
    if '-' in string:
        d1, d2 = string.split('-')
        avg = round((int(d1) + int(d2)) / 2) 
        return avg
    else:
        return int(string)
    
for row in moma:
    string = row[6]
    string = process_date(string)
    row[6] = string
    
print_rows(moma, 3)

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', 'American', 1947, 2013, 'Female', 1986, 'Prints & Illustrated Books'] 

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA', 'Pablo Palazuelo', 'Spanish', 1916, 2007, 'Male', 1978, 'Prints & Illustrated Books'] 

['Tailpiece (page 55) from SAGESSE', 'Maurice Denis', 'French', 1870, 1943, 'Male', 1900, 'Prints & Illustrated Books'] 



## Analyze data

Working with the dataset, we'll do the following:
- Calculate the artist's age when they created their artwork
- Analyze and interpret the distribution of artist ages
- Create functions that summarize our data
- Print summaries in an easy-to-read-way

In [31]:
ages = []

for row in moma:
    date = row[6]
    birth = row[3]
    age = 0
    if type(birth) == int:
        age = int(date) - int(birth)
    ages.append(age)
    
final_ages = []

for age in ages:
    final_age = 'Unknown'
    if age > 20:
        final_age = age
    final_ages.append(final_age)
    
decades = []

for age in final_ages:
    decade = 'Unknown'
    if age != 'Unknown':
        decade = str(age)
        decade = decade[:-1]
        decade += '0s'
    decades.append(decade)
    
# Create a frequency table for each decade
decade_frequency = {}

for dec in decades:
    if dec in decade_frequency:
        decade_frequency[dec] += 1
    else:
        decade_frequency[dec] = 1
        
print(decade_frequency)

{'30s': 4722, '60s': 1357, '70s': 559, '40s': 4081, '50s': 2434, '20s': 1856, 'Unknown': 1093, '90s': 253, '80s': 364, '100s': 3, '110s': 3}


Create a frequency table for the number of works each artist created

In [32]:
artist_freq = {}

for row in moma:
    artist = row[1]
    if artist not in artist_freq:
        artist_freq[artist] = 1
    else:
        artist_freq[artist] += 1
        
print(artist_freq)

{'Sarah Charlesworth': 1, 'Pablo Palazuelo': 4, 'Maurice Denis': 71, 'Aristide Maillol': 77, 'Eugène Atget': 705, 'Antonio Frasconi': 41, 'Garry Winogrand': 47, 'Diane Victor': 4, 'David Brown Milne': 2, 'Jean Dubuffet': 206, 'Jim Dine': 57, 'František Kupka': 37, 'Franklin Chenault Watkins': 4, 'Christopher Wool': 19, 'Abraham Walkowitz': 19, 'Pierre Alechinsky': 67, 'Frank Stella': 17, 'Frank Lloyd Wright': 112, 'Vicente Rojo': 5, 'Ludwig Mies van der Rohe': 318, 'Varvara Stepanova': 6, 'Richard Serra': 4, 'Robert Filliou': 15, 'Roger Chancel': 3, 'Pierre Bonnard': 129, 'Jacqueline Poncelet': 1, 'Émile Bernard': 83, 'Georg Baselitz': 14, 'Frans Masereel': 34, 'Unknown': 448, 'Sol LeWitt': 89, 'James Tenney': 1, 'Claes Oldenburg': 12, 'Dieter Roth': 18, 'Moisei Fradkin': 1, 'Richard Lindner': 1, 'Wojciech Prazmowski': 2, 'Thomas Bewick': 49, 'Spencer Sweeney': 2, 'Batiste Madalena': 5, 'On Kawara': 9, 'Andy Warhol': 41, 'Lee Friedlander': 180, 'Joan Miró': 78, 'Marc Chagall': 173, 'Ro

Sort the frequency table to see which artists have created the most art

In [33]:
table = artist_freq
table_display = []
for key in table:
    key_val_as_tuple = (table[key], key)
    table_display.append(key_val_as_tuple)

table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0])

Eugène Atget : 705
Louise Bourgeois : 495
Unknown : 448
Ludwig Mies van der Rohe : 318
Jean Dubuffet : 206
Lee Friedlander : 180
Marc Chagall : 173
Pierre Bonnard : 129
Henri Matisse : 129
Lilly Reich : 118
Frank Lloyd Wright : 112
August Sander : 105
Sol LeWitt : 89
André Derain : 86
Pablo Picasso : 84
Émile Bernard : 83
Dorothea Lange : 83
Joan Miró : 78
Aristide Maillol : 77
Jasper Johns : 76
Raoul Dufy : 74
George Maciunas : 72
Maurice Denis : 71
Georges Rouault : 68
Pierre Alechinsky : 67
Jan Dibbets : 66
Walker Evans : 59
Jim Dine : 57
Album-miscellaneous : 57
Edward Steichen : 51
Thomas Bewick : 49
Christian Boltanski : 49
André Masson : 49
José Antonio Suárez Londoño : 48
Garry Winogrand : 47
Robert Rauschenberg : 45
Ben Kinmont : 45
Ansel Adams : 45
Harry Callahan : 43
Aleksandr Rodchenko : 43
Various Artists : 42
Jacques Villon : 42
Henri Cartier-Bresson : 42
Antonio Frasconi : 41
Andy Warhol : 41
Jules Pascin : 40
John Thomson : 40
Ben Vautier : 40
Alighiero Boetti : 40
Mich

Robert Morris : 6
Robert Indiana : 6
Robert Gober : 6
Richard Long : 6
Peter Henry Emerson : 6
Paulo Monteiro : 6
Paul McCarthy : 6
Paul Martin : 6
Paul Chan : 6
Patrick Caulfield : 6
Otl Aicher (also known as Otto Aicher) : 6
Oscar Domínguez : 6
Minor White : 6
Miguel Rio Branco : 6
Maurizio Cattelan : 6
Marlene Dumas : 6
Louis Comfort Tiffany : 6
Leonid Tishkov : 6
Leonel Gongora : 6
Laurie Anderson : 6
Laura Owens : 6
Larry Fink : 6
Julie Mehretu : 6
Juan Gris : 6
Joseph Grigely : 6
John Szarkowski : 6
John Piper : 6
John Coplans : 6
Joel Shapiro : 6
Jo Baer : 6
Jim Shaw : 6
Jerry Uelsmann : 6
Jean Fautrier : 6
Jan Tschichold : 6
James Turrell : 6
James Siena : 6
Ilya Kabakov : 6
Harun Farocki : 6
Harrell Fletcher : 6
Harold Altman : 6
Hal Fischer : 6
Gustave Le Gray : 6
Gottfried Honegger : 6
George Segal : 6
George Lois : 6
Gego (Gertrud Goldschmidt) : 6
Gabor Peterdi : 6
Friz Freleng : 6
Fritz Glarner : 6
Frederick Sommer : 6
František Kalivoda : 6
Felix Gonzalez-Torres : 6
Elie 

Nikolai Kul'bin : 3
Nicolas Garcia Uriburu : 3
Nicholas Felton : 3
Neelon Crawford : 3
Naomi Savage : 3
Misch Kohn : 3
Ming Smith : 3
Michael Wesely : 3
Michael Spano : 3
Michael Graves : 3
Michael Engelmann : 3
Micha Ullman : 3
Menashe Kadishman : 3
Maxi Cohen : 3
Max Pechstein : 3
Matthew Monahan : 3
Massimo Vignelli : 3
Mary Frank : 3
Martine Franck : 3
Marli Ehrman : 3
Marko Spalatin : 3
Mark Lombardi : 3
Mark Bradford : 3
Mario Merz : 3
Mario Avati : 3
Marco Maggi : 3
Marco Breuer : 3
Marcel Mathys : 3
Marcel Jean : 3
Marcel Breuer : 3
Manfred Pernice : 3
Mabel Dwight : 3
Léon Gischia : 3
Luis Buñuel : 3
Ludwig Hohlwein : 3
Lucie Rie : 3
Lou Stoumen : 3
Leslie Hewitt : 3
Leonard Freed : 3
Leon Krier : 3
Lebbeus Woods : 3
Laurence Daws : 3
Laura Gilpin : 3
Larry Clark : 3
Käthe Kollwitz : 3
Konstantin Grcic : 3
Kinji Akagawa : 3
Ken Domon : 3
Keith Milow : 3
Katharina Fritsch : 3
Karl Schrag : 3
Karl Gerstner : 3
Karel Appel : 3
June Redfern : 3
Julião Sarmento : 3
Juliusz Studnick

M/M Paris : 2
Léo Leuppi : 2
László Peri : 2
Lynne Yamamoto : 2
Lynn Chadwick : 2
Lyle Starr : 2
Lygia Clark : 2
Lydia Naumova : 2
Lutz Mommartz : 2
Luigi Scrivo : 2
Ludwig Meidner : 2
Lucio Pozzi : 2
Lucio Fontana : 2
Lucien Hervé : 2
Luc-Albert Moreau : 2
Lourdes Portillo : 2
Louis Silverstein : 2
Louis Favre : 2
Louis Faurer : 2
Lotte Jacobi : 2
Lothar Schreyer : 2
Lois Conner : 2
Llyn Foulkes : 2
Lisa Yuskavage : 2
Lisa Baumgardner : 2
Linda Connor : 2
Lillian Schwartz : 2
Liliana Porter : 2
Lettie A. Allen : 2
Leon Golub : 2
Leo Lionni : 2
Lennart Olson : 2
Lee Lozano : 2
Lawrence Kupferman : 2
Laurie Simmons : 2
Laurel Nakadate : 2
Laura Grisi : 2
Larry Miller : 2
Lajos Kassák : 2
Kurt Seligmann : 2
Kumi Sugaï : 2
Kosti Ruohomaa : 2
Koloman Moser : 2
Koichi Sato : 2
Kn Thurlbeck : 2
Klaus Rinke : 2
Kikuji Kawada : 2
Ketaki Sheth : 2
Kerry Tribe : 2
Ker-Xavier Roussel : 2
Kenneth Brozen : 2
Kengo Kuma : 2
Ken Jacobs : 2
Kawanishi Hide : 2
Katsuhiro Yamaguchi : 2
Karl Free : 2
Jörg

W. & D. Downey : 1
Vsevolod Pudovkin : 1
Vladimir Velickovic : 1
Vladimir Kozlinskii : 1
Vladimir Izenberg : 1
Vladimir Borissovitch Yankilevsky : 1
Vivian Cherry : 1
Vita Castro : 1
Vincent de Rijk : 1
Vincent Carelli : 1
Vincent Borrelli : 1
Vincent Alan W : 1
Vilem Reichmann : 1
Victoria Sambunaris : 1
Victor Schrager : 1
Victor Pasmore : 1
Victor Mira : 1
Victor Huey : 1
Victor Grippo : 1
Victor Fleming : 1
Vibeke Tandberg : 1
Viacheslav Pakulin : 1
Viaceslav Kalinin : 1
Vernon Heath : 1
Verner Panton : 1
Vern Blosum : 1
Velox Ward : 1
Vasyl' Sedliar : 1
Vasudeo S. Gaitonde : 1
Vaslaw Nijinsky : 1
Vasilii Kamenskii : 1
Vanessa Bell : 1
Valve : 1
Valto Kokko : 1
Valentine Hugo : 1
Valentina Kulagina-Klutsis : 1
Valentina Kulagina : 1
Valentina Khodasevich : 1
Val Telberg : 1
Vaclav Vytlacil : 1
Uwe Lausen : 1
Uta Barth : 1
Ursula von Rydingsvard : 1
Umberto Boccioni : 1
Ulli Maier : 1
Ula Hedwig : 1
Ugo Rondinone : 1
Tulio Raggi : 1
Tulio Crali : 1
Tsugouharu Foujita : 1
Tristan Tza

Paul Brühwiler : 1
Paul A. McDonough : 1
Patrick Procktor : 1
Patrick Hughes : 1
Patrick Heron : 1
Patricia Johanson : 1
Patricia Cardoso : 1
Patrice Leconte : 1
Pat Passlof : 1
Pat O'Connor : 1
Pat Lasch : 1
Pat Candido/The New York Daily News : 1
Pasqualino Cangiullo : 1
Pasquale Santoro : 1
Parr : 1
Paris Observatory : 1
Paolo Rizzatto : 1
Paolo Lombardi : 1
Paolo Labañino : 1
Paolo Gasparini : 1
Palmer Hayden : 1
Palle Nielson : 1
P. C. Helleu : 1
P. A. Miller : 1
Otto Wichterle : 1
Otto Treumann : 1
Otto Steinert : 1
Otto Prutscher : 1
Otto Piene : 1
Otto Nebel : 1
Otto Freundlich : 1
Otis Shepard : 1
Othon Coubine (or Otakar Kubin) : 1
Oswald Michel : 1
Oswald Haerdtl : 1
Osvaldo Romberg : 1
Osvaldo Peruzzi : 1
Osvaldo Borsani : 1
Ossip Zadkine : 1
Oskar Schlemmer : 1
Oskar Kogoj : 1
Osiah Masekoameng : 1
Oscar Tenreiro Degwitz : 1
Orlando Mesquita : 1
Oren Moverman : 1
Omar Carreño : 1
Olympe Aguado de las Marismas : 1
Olivier : 1
Oliver Jackson : 1
Oliver Hermanus : 1
Olin Dows

Julia Wachtel : 1
Julia Margaret Cameron : 1
Jules Allen : 1
Judith Shea : 1
Judith Godwin : 1
Juan Soriano : 1
Juan Gómez-Quiroz : 1
Juan Genovés : 1
Joël Stein : 1
João Luis Sol de Carvalho : 1
Jozsef Rippl-Rónai : 1
Joyce Pensato : 1
José Yalenti : 1
José Sabogal : 1
José R. Alicea : 1
José María Sicilia : 1
José Leonilson : 1
José Gamarra : 1
José De Creeft : 1
Joshua Light Show : 1
Joseph Stalnaker : 1
Joseph Nechvatal : 1
Joseph Maria Olbrich : 1
Joseph Janvier Woodward : 1
Joseph Hirsch : 1
Joseph Glasco : 1
Joseph Cornell : 1
Joseph Breitenbach : 1
Josep Subirats Samora : 1
Josef Sudek : 1
Josef Scharl : 1
Josef Peeters : 1
Josef Gielniak : 1
Josef Bato : 1
Jose Espert : 1
Jose Antonio Fernández-Muro : 1
Jos van der Meulen : 1
Jorge Macchi : 1
Jordi Secall Roure : 1
Joost Schmidt : 1
Joon-ho Bong : 1
Jonathas de Andrade : 1
Jonathan Monk : 1
Jonathan Lasker : 1
Jonathan Horowitz : 1
Jonathan Borofsky : 1
Jonas J. Fendell : 1
Jon Widman : 1
Jon T. O'Neal : 1
John William Carnell

Frederic Karoly : 1
Fred Tomaselli : 1
Fred Taylor : 1
Fred Stein : 1
Fred Niblo : 1
Françoise Gilot : 1
François Kollar : 1
Franz Kline : 1
Franz Hitzler : 1
Frank Magnotta : 1
Frank Jay Haynes : 1
Frank Hinder : 1
Frank Graham Holmes : 1
Frank Dobson : 1
Frank Badur : 1
Frank Auerbach : 1
Francisco Toledo : 1
Francisco Matto : 1
Francisco Blanco : 1
Francis Thompson : 1
Francis Bacon : 1
Francesco Cangiullo : 1
Francesco Binfaré : 1
Francesc Torres : 1
Frances Butler : 1
Fougasse (Cyril Kenneth Bird) : 1
Fotograms : 1
Fortunato Depero : 1
Forrest Bess : 1
Forbes Johnstone Whiteside : 1
Flemming Bo Hansen : 1
Fernando Leal : 1
Fernando Castillo : 1
Fernando Bryce : 1
Felix Harlan : 1
Felix Beltran : 1
Felipe Ehrenberg : 1
Fayga Ostrower : 1
Faurest Davis : 1
Farnese de Andrade-Neto : 1
Farkas Molnár : 1
Faouzi Bensaïdi : 1
Fannie Hillsmith : 1
Fang Lijun : 1
Fabian Marcaccio : 1
F.W. Murnau : 1
F. Armbruster : 1
Ezra Stoller : 1
Ezio Pirali : 1
Ezio Martinelli : 1
Ezio Gribaudo : 1
Ew

April Greiman : 1
Apple Industrial Design Group : 1
Apichatpong Weerasethakul : 1
Anupama Kundoo : 1
Antoni Clavé : 1
Anton Stankowski : 1
Anton Bruehl : 1
Antoine Pevsner : 1
Anthony Tennant : 1
Anthony Minghella : 1
Anthony McCall : 1
Anthony Harrison : 1
Anthony Gross : 1
Anthony Goicolea : 1
Anthony Barboza : 1
Annette Kelm : 1
Annemarie Heinrich : 1
Anne W. Brigman : 1
Anne Turyn : 1
Anne Truitt : 1
Anne Noggle : 1
Anne Marie Fishbein : 1
Anne Goldthwaite : 1
Anna Bella Geiger : 1
Ann Magnuson : 1
Ann Hamilton : 1
Angus Fairhurst : 1
Angel Bracho : 1
André Téchiné : 1
André Mare : 1
André Lhote : 1
André Fougeron : 1
André Favory : 1
André Calmettes : 1
André Cadere : 1
Andrzej J. Wroblewski : 1
Andrew Wyeth : 1
Andrew Noren : 1
Andrew Lord : 1
Andrew Kay Womrath : 1
Andrew Joseph Russell : 1
Andrei Zvyagintsev : 1
Andrei Gippius : 1
Andreas Feininger : 1
Andrea Modica : 1
Anatoly Timofeevich Zverev : 1
Anatol Stern : 1
Ana Maria Moncalvo : 1
Amédée Ozenfant : 1
Amelie von Wulffen

Print summary statistics about various artists

In [34]:
def artist_summary(artist):
    num_artworks = artist_freq[artist]
    if num_artworks > 1:
        return "There are {} artworks by {} in the data set".format(num_artworks, artist)
    elif num_artworks == 1:
        return "There is {} artwork by {} in the data set".format(num_artworks, artist)
    else:
        return "There is no artwork by {} in the data set".format(artist)
    
print (artist_summary("Henri Matisse"))
print (artist_summary("Sarah Charlesworth"))

There are 129 artworks by Henri Matisse in the data set
There is 1 artwork by Sarah Charlesworth in the data set


Display information about the frequencies of artwork by artists of different genders

In [35]:
gender_freq = {}

for row in moma:
    gender = row[5]
    if gender not in gender_freq:
        gender_freq[gender] = 1
    else:
        gender_freq[gender] += 1
        
for key, value in gender_freq.items():
    print("There are {:,} artworks by {} artists".format(value, key))

There are 2,443 artworks by Female artists
There are 13,491 artworks by Male artists
There are 791 artworks by Gender Unknown/Other artists
