# MoMA Data Cleaning and Analysis

Data about the art in the Museum of Modern Art (MoMA)

## Data headers
Column      | Index | Description
----------- | --- | -----------
Title | 0 | the title of the artwork
Artist | 1 | the name of the artist who created the artwork
Nationality | 2 |  the nationality of the artist
BeginDate | 3 | the year in which the artist was born
EndDate | 4 | the year in which the artist died
Gender | 5 | the gender of the artist
Date | 6 | the date that the artwork was created
Department | 7 | the department inside MoMA to which the artwork belongs

## Read data
Open the data file and read in the data

Print the header row, then delete it from data

In [23]:
from csv import reader
opened_file = open('artworks.csv', encoding='utf-8')
read_file = reader(opened_file)
moma = list(read_file)
print(moma[0])
moma = moma[1:]

['Title', 'Artist', 'Nationality', 'BeginDate', 'EndDate', 'Gender', 'Date', 'Department']


Print the first few rows of the data

In [24]:
def print_rows(dataset, num_rows):
    for row in dataset[:num_rows]:
        print (row, '\n')
    return

print_rows(moma, 3)

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', '(American)', '(1947)', '(2013)', '(Female)', '1986', 'Prints & Illustrated Books'] 

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA', 'Pablo Palazuelo', '(Spanish)', '(1916)', '(2007)', '(Male)', '1978', 'Prints & Illustrated Books'] 

['Tailpiece (page 55) from SAGESSE', 'Maurice Denis', '(French)', '(1870)', '(1943)', '(Male)', '1889-1911', 'Prints & Illustrated Books'] 



## Clean data
Remove the parentheses from several columns
 
Rather than re-writing code for each column as instructed in DQ, I've created a function that can be re-used for each column we want to clean

In [25]:
def remove_parens(dataset, index):
    for row in dataset:
        string = row[index]
        string = string.replace('(', '')
        string = string.replace(')', '')
        row[index] = string
    return dataset

remove_parens(moma, 2)
remove_parens(moma, 3)
remove_parens(moma, 4)
remove_parens(moma, 5)
print() # used to avoid printing the return of the final remove_parens call




Print some data to validate that parens have been removed

In [26]:
print_rows(moma, 3)

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', 'American', '1947', '2013', 'Female', '1986', 'Prints & Illustrated Books'] 

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA', 'Pablo Palazuelo', 'Spanish', '1916', '2007', 'Male', '1978', 'Prints & Illustrated Books'] 

['Tailpiece (page 55) from SAGESSE', 'Maurice Denis', 'French', '1870', '1943', 'Male', '1889-1911', 'Prints & Illustrated Books'] 



Clean up data in the Gender [5] and Nationality [2] columns

Normalize capitalization and add values to specify unknown as needed

In [27]:
for row in moma:
    gender = row[5]
    gender = gender.title()
    if not gender:
        gender = 'Gender Unknown/Other'
    row[5] = gender
    
    nat = row[2]
    nat = nat.title()
    if not nat:
        nat = 'Nationality Unknown'
    row[2] = nat

Convert dates from strings to integer values, to make them easier to work with

In [28]:
def convert_to_int(dataset, index):
    for row in dataset:
        string = row[index]
        if string != '':
            string = int(string)
            row[index] = string
    return dataset

convert_to_int(moma, 3)
convert_to_int(moma, 4)

print_rows(moma, 3)

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', 'American', 1947, 2013, 'Female', '1986', 'Prints & Illustrated Books'] 

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA', 'Pablo Palazuelo', 'Spanish', 1916, 2007, 'Male', '1978', 'Prints & Illustrated Books'] 

['Tailpiece (page 55) from SAGESSE', 'Maurice Denis', 'French', 1870, 1943, 'Male', '1889-1911', 'Prints & Illustrated Books'] 



The data for when art was created has some variances - there are some additional characters included in some rows (e.g. 'c. 1920' instead of just '1920') and some rows include a range of years instead of a specific year

Remove extra characters

In addition to the DQ exercise, added functionality to detect any characters outside a list of valid characters. '-' is included as a valid character, as it denotes a date range, which will be parsed in the next section

In [29]:
valid_chars = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0', '-']
bad_chars = []
for row in moma:
    string = row[6]
    for char in string:
        if char not in valid_chars:
            string = string.replace(char, '')
            if char not in bad_chars:
                bad_chars.append(char)
    row[6] = string
                               
print('Found the following non-valid chars: ', bad_chars, '\n')

print_rows(moma, 3)

Found the following non-valid chars:  ['(', ')', 'c', '.', ' ', 's', "'"] 

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', 'American', 1947, 2013, 'Female', '1986', 'Prints & Illustrated Books'] 

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA', 'Pablo Palazuelo', 'Spanish', 1916, 2007, 'Male', '1978', 'Prints & Illustrated Books'] 

['Tailpiece (page 55) from SAGESSE', 'Maurice Denis', 'French', 1870, 1943, 'Male', '1889-1911', 'Prints & Illustrated Books'] 



Where we have a range of dates, convert the value to the average

When the third row of data is printed, see that the date range has been converted to a single date that averages the previous range

In [30]:
def process_date(string):
    if '-' in string:
        d1, d2 = string.split('-')
        avg = round((int(d1) + int(d2)) / 2) 
        return avg
    else:
        return int(string)
    
for row in moma:
    string = row[6]
    string = process_date(string)
    row[6] = string
    
print_rows(moma, 3)

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', 'American', 1947, 2013, 'Female', 1986, 'Prints & Illustrated Books'] 

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA', 'Pablo Palazuelo', 'Spanish', 1916, 2007, 'Male', 1978, 'Prints & Illustrated Books'] 

['Tailpiece (page 55) from SAGESSE', 'Maurice Denis', 'French', 1870, 1943, 'Male', 1900, 'Prints & Illustrated Books'] 



## Analyze data

Working with the dataset, we'll do the following:
- Calculate the artist's age when they created their artwork
- Analyze and interpret the distribution of artist ages
- Create functions that summarize our data
- Print summaries in an easy-to-read-way

In [38]:
# Determine the age of each artist
ages = []

for row in moma:
    date = row[6]
    birth = row[3]
    age = 0
    if type(birth) == int:
        age = int(date) - int(birth)
    ages.append(age)
    
final_ages = []

for age in ages:
    final_age = 'Unknown'
    if age > 20:
        final_age = age
    final_ages.append(final_age)

# Convert the age into a decade
decades = []

for age in final_ages:
    decade = 'Unknown'
    if age != 'Unknown':
        decade = str(age)
        decade = decade[:-1]
        decade += '0s'
    decades.append(decade)
    
# Create a frequency table for each decade
decade_frequency = {}

for dec in decades:
    if dec in decade_frequency:
        decade_frequency[dec] += 1
    else:
        decade_frequency[dec] = 1
        
print(decade_frequency)

{'30s': 4722, '60s': 1357, '70s': 559, '40s': 4081, '50s': 2434, '20s': 1856, 'Unknown': 1093, '90s': 253, '80s': 364, '100s': 3, '110s': 3}


Create a frequency table for the number of works each artist created

In [37]:
artist_freq = {}

for row in moma:
    artist = row[1]
    if artist not in artist_freq:
        artist_freq[artist] = 1
    else:
        artist_freq[artist] += 1
        
#print(artist_freq)

Sort the frequency table to see which artists have created the most art

In [36]:
table = artist_freq
table_display = []
for key in table:
    key_val_as_tuple = (table[key], key)
    table_display.append(key_val_as_tuple)

table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted[:20]:
    print(entry[1], ':', entry[0])

Eugène Atget : 705
Louise Bourgeois : 495
Unknown : 448
Ludwig Mies van der Rohe : 318
Jean Dubuffet : 206
Lee Friedlander : 180
Marc Chagall : 173
Pierre Bonnard : 129
Henri Matisse : 129
Lilly Reich : 118
Frank Lloyd Wright : 112
August Sander : 105
Sol LeWitt : 89
André Derain : 86
Pablo Picasso : 84
Émile Bernard : 83
Dorothea Lange : 83
Joan Miró : 78
Aristide Maillol : 77
Jasper Johns : 76


Print summary statistics about various artists

In [34]:
def artist_summary(artist):
    num_artworks = artist_freq[artist]
    if num_artworks > 1:
        return "There are {} artworks by {} in the data set".format(num_artworks, artist)
    elif num_artworks == 1:
        return "There is {} artwork by {} in the data set".format(num_artworks, artist)
    else:
        return "There is no artwork by {} in the data set".format(artist)
    
print (artist_summary("Henri Matisse"))
print (artist_summary("Sarah Charlesworth"))

There are 129 artworks by Henri Matisse in the data set
There is 1 artwork by Sarah Charlesworth in the data set


Display information about the frequencies of artwork by artists of different genders

In [35]:
gender_freq = {}

for row in moma:
    gender = row[5]
    if gender not in gender_freq:
        gender_freq[gender] = 1
    else:
        gender_freq[gender] += 1
        
for key, value in gender_freq.items():
    print("There are {:,} artworks by {} artists".format(value, key))

There are 2,443 artworks by Female artists
There are 13,491 artworks by Male artists
There are 791 artworks by Gender Unknown/Other artists
