# MoMA Data Analysis

We are going to work with data about the art in the Museum of Modern Art (MoMA)

## Data headers
 Column      | Index | Description
----------- | --- | -----------
Title | 0 | the title of the artwork
Artist | 1 | the name of the artist who created the artwork
Nationality | 2 |  the nationality of the artist
BeginDate | 3 | the year in which the artist was born
EndDate | 4 | the year in which the artist died
Gender | 5 | the gender of the artist
Date | 6 | the date that the artwork was created
Department | 7 | the department inside MoMA to which the artwork belongs

Open the data file and read in the data.

Print the header row, then delete it from our data

In [27]:
from csv import reader
opened_file = open('artworks.csv', encoding='utf-8')
read_file = reader(opened_file)
moma = list(read_file)
print(moma[0])
moma = moma[1:]

['Title', 'Artist', 'Nationality', 'BeginDate', 'EndDate', 'Gender', 'Date', 'Department']


Print the first few rows of the data

In [28]:
for row in moma[:3]:
    print (row, '\n')

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', '(American)', '(1947)', '(2013)', '(Female)', '1986', 'Prints & Illustrated Books'] 

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA', 'Pablo Palazuelo', '(Spanish)', '(1916)', '(2007)', '(Male)', '1978', 'Prints & Illustrated Books'] 

['Tailpiece (page 55) from SAGESSE', 'Maurice Denis', '(French)', '(1870)', '(1943)', '(Male)', '1889-1911', 'Prints & Illustrated Books'] 



Remove the parentheses from several columns
 
Rather than re-writing code for each column as instructed in DQ, I've created a function that can be re-used for each column we want to clean

In [29]:
def remove_parens(dataset, index):
    for row in dataset:
        str = row[index]
        str = str.replace('(', '')
        str = str.replace(')', '')
        row[index] = str
    return dataset

remove_parens(moma, 2)
remove_parens(moma, 3)
remove_parens(moma, 4)
remove_parens(moma, 5)
print() # used to avoid printing the return of the final remove_parens call




Print the data to validate that parens have been removed

In [30]:
for row in moma[:3]:
    print (row, '\n')

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', 'American', '1947', '2013', 'Female', '1986', 'Prints & Illustrated Books'] 

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA', 'Pablo Palazuelo', 'Spanish', '1916', '2007', 'Male', '1978', 'Prints & Illustrated Books'] 

['Tailpiece (page 55) from SAGESSE', 'Maurice Denis', 'French', '1870', '1943', 'Male', '1889-1911', 'Prints & Illustrated Books'] 



Clean up data in the Gender [5] and Nationality [2] columns

Normalize capitalization and add values to specify unknown as needed

In [31]:
for row in moma:
    gender = row[5]
    gender = gender.title()
    if not gender:
        gender = 'Gender Unknown/Other'
    row[5] = gender
    
    nat = row[2]
    nat = nat.title()
    if not nat:
        nat = 'Nationality Unknown'
    row[2] = nat

Convert dates from strings to integer values, to make them easier to work with

In [32]:
def convert_to_int(dataset, index):
    for row in dataset:
        str = row[index]
        if str != '':
            str = int(str)
            row[index] = str
    return dataset

convert_to_int(moma, 3)
convert_to_int(moma, 4)

for row in moma[:3]:
    print (row, '\n')

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', 'American', 1947, 2013, 'Female', '1986', 'Prints & Illustrated Books'] 

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA', 'Pablo Palazuelo', 'Spanish', 1916, 2007, 'Male', '1978', 'Prints & Illustrated Books'] 

['Tailpiece (page 55) from SAGESSE', 'Maurice Denis', 'French', 1870, 1943, 'Male', '1889-1911', 'Prints & Illustrated Books'] 



The data for when art was created has some variances - there are some additional characters included in some rows (e.g. 'c. 1920' instead of just '1920') and some rows include a range of years instead of a specific year

Remove extra characters

In [33]:
valid_chars = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0', '-']
bad_chars = []
for row in moma:
    str = row[6]
    for char in str:
        if char not in valid_chars:
            str = str.replace(char, '')
            if char not in bad_chars:
                bad_chars.append(char)
    row[6] = str
                               
print(bad_chars, '\n')

for row in moma[:3]:
    print (row, '\n')

['(', ')', 'c', '.', ' ', 's', "'"] 

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', 'American', 1947, 2013, 'Female', '1986', 'Prints & Illustrated Books'] 

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA', 'Pablo Palazuelo', 'Spanish', 1916, 2007, 'Male', '1978', 'Prints & Illustrated Books'] 

['Tailpiece (page 55) from SAGESSE', 'Maurice Denis', 'French', 1870, 1943, 'Male', '1889-1911', 'Prints & Illustrated Books'] 



Where we have a range of dates, convert the value to the average

In [34]:
def process_date(string):
    if '-' in string:
        d1, d2 = string.split('-')
        avg = round((int(d1) + int(d2)) / 2) 
        return avg
    else:
        return int(string)
    
for row in moma:
    string = row[6]
    string = process_date(string)
    row[6] = string
    
for row in moma[:3]:
    print (row, '\n')

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', 'American', 1947, 2013, 'Female', 1986, 'Prints & Illustrated Books'] 

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA', 'Pablo Palazuelo', 'Spanish', 1916, 2007, 'Male', 1978, 'Prints & Illustrated Books'] 

['Tailpiece (page 55) from SAGESSE', 'Maurice Denis', 'French', 1870, 1943, 'Male', 1900, 'Prints & Illustrated Books'] 

