# MoMa's collection data cleaning

## Problem solving
I'm working for the MoMa and they'd like to know in which department they need to enrich based on the current collection. 

**What is the Top-3 less valuable classification?**

In [1]:
import numpy as np
import pandas as pd
import re

In [2]:
df=pd.read_csv('data/museum_modern_art.csv',sep=',')
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0.1,Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,0,"Ferdinandsbrücke Project, Vienna, Austria, Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,...,http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s...,,,,48.6,,,168.9,,
1,1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,...,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...,,,,40.6401,,,29.8451,,
2,2,"Villa near Vienna Project, Outside Vienna, Aus...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,...,http://www.moma.org/media/W1siZiIsIjk4Il0sWyJw...,,,,34.3,,,31.8,,
3,3,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)",(),(1944),(0),(Male),1980,...,http://www.moma.org/media/W1siZiIsIjEyNCJdLFsi...,,,,50.8,,,50.8,,
4,4,"Villa, project, outside Vienna, Austria, Exter...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,...,http://www.moma.org/media/W1siZiIsIjEyNiJdLFsi...,,,,38.4,,,19.1,,


In [3]:
df.dtypes

Unnamed: 0             object
Title                  object
Artist                 object
ConstituentID          object
ArtistBio              object
Nationality            object
BeginDate              object
EndDate                object
Gender                 object
Date                   object
Medium                 object
Dimensions             object
CreditLine             object
AccessionNumber        object
Classification         object
Department             object
DateAcquired           object
Cataloged              object
ObjectID               object
URL                    object
ThumbnailURL           object
Circumference (cm)    float64
Depth (cm)            float64
Diameter (cm)         float64
Height (cm)           float64
Length (cm)           float64
Weight (kg)           float64
Width (cm)            float64
Seat Height (cm)      float64
Duration (sec.)       float64
dtype: object

In [4]:
df.shape

(152487, 30)

## Renaming columns

In [5]:
df1=df.rename(columns={'Unnamed: 0':'Id'})

In [6]:
df1.columns

Index(['Id', 'Title', 'Artist', 'ConstituentID', 'ArtistBio', 'Nationality',
       'BeginDate', 'EndDate', 'Gender', 'Date', 'Medium', 'Dimensions',
       'CreditLine', 'AccessionNumber', 'Classification', 'Department',
       'DateAcquired', 'Cataloged', 'ObjectID', 'URL', 'ThumbnailURL',
       'Circumference (cm)', 'Depth (cm)', 'Diameter (cm)', 'Height (cm)',
       'Length (cm)', 'Weight (kg)', 'Width (cm)', 'Seat Height (cm)',
       'Duration (sec.)'],
      dtype='object')

## Drop empty and useless tables

In [7]:
null_col=df1.isna().sum()
null_col_percent=round(null_col[null_col>0]/df1.shape[0]*100,2)
null_col_percent

Title                   0.03
Artist                 12.20
ConstituentID          12.20
ArtistBio              14.81
Nationality            12.20
BeginDate              12.20
EndDate                12.20
Gender                 12.20
Date                   12.84
Medium                 18.64
Dimensions             18.40
CreditLine             13.14
AccessionNumber        11.26
Classification         11.26
Department             11.26
DateAcquired           15.62
Cataloged              11.26
ObjectID               11.26
URL                    49.17
ThumbnailURL           56.12
Circumference (cm)     99.99
Depth (cm)             91.45
Diameter (cm)          99.07
Height (cm)            23.75
Length (cm)            99.52
Weight (kg)            99.81
Width (cm)             24.34
Seat Height (cm)      100.00
Duration (sec.)        97.93
dtype: float64

In [8]:
drop_cols=null_col_percent[null_col_percent>50].index
df2=df1.drop(drop_cols,axis=1)

In [9]:
df2.shape

(152487, 22)

## Remove Duplicates

In [10]:
df3=df2.copy()
df3.iloc[:,:].duplicated().sum()

17168

In [11]:
print(df3.iloc[:,:].shape)
df4=df3.iloc[:,:].drop_duplicates()
print(df4.iloc[:,:].shape)

(152487, 22)
(135319, 22)


## Put off parenthesis on text in relevant columns

In [12]:
parenthesis_col=['ArtistBio','Nationality','BeginDate','EndDate','Gender','Date']
parenthesis_col

df5=df4.copy()

In [13]:
for col in parenthesis_col:
    df5[col]=df5[col].str.replace('\(','').str.replace('\)','')
    
df5.head()

Unnamed: 0,Id,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,...,CreditLine,AccessionNumber,Classification,Department,DateAcquired,Cataloged,ObjectID,URL,Height (cm),Width (cm)
0,0,"Ferdinandsbrücke Project, Vienna, Austria, Ele...",Otto Wagner,6210,"Austrian, 1841–1918",Austrian,1841,1918,Male,1896,...,Fractional and promised gift of Jo Carole and ...,885.1996,Architecture,Architecture & Design,1996-04-09,Y,2,http://www.moma.org/collection/works/2,48.6,168.9
1,1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"French, born 1944",French,1944,0,Male,1987,...,Gift of the architect in honor of Lily Auchinc...,1.1995,Architecture,Architecture & Design,1995-01-17,Y,3,http://www.moma.org/collection/works/3,40.6401,29.8451
2,2,"Villa near Vienna Project, Outside Vienna, Aus...",Emil Hoppe,7605,"Austrian, 1876–1957",Austrian,1876,1957,Male,1903,...,Gift of Jo Carole and Ronald S. Lauder,1.1997,Architecture,Architecture & Design,1997-01-15,Y,4,http://www.moma.org/collection/works/4,34.3,31.8
3,3,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"French and Swiss, born Switzerland 1944",,1944,0,Male,1980,...,Purchase and partial gift of the architect in ...,2.1995,Architecture,Architecture & Design,1995-01-17,Y,5,http://www.moma.org/collection/works/5,50.8,50.8
4,4,"Villa, project, outside Vienna, Austria, Exter...",Emil Hoppe,7605,"Austrian, 1876–1957",Austrian,1876,1957,Male,1903,...,Gift of Jo Carole and Ronald S. Lauder,2.1997,Architecture,Architecture & Design,1997-01-15,Y,6,http://www.moma.org/collection/works/6,38.4,19.1


In [15]:
df5.iloc[:,:].duplicated().sum()

0

## Clean Date values

In [16]:
df6=df5.copy()
print(df6.Date.unique())
start_values=df6.Date.nunique()
print("total unique values in date: ",start_values)

['1896' '1987' '1903' ... '1961-1962' 'early 1980s' '1979–1983']
total unique values in date:  8815


In [17]:
df6.Date=df6.Date.astype(str)

In [18]:
def test(date):
    count_not_str=0
    if type(date) != str:
        count_not_str+=1
    return count_not_str

# Check if convert is working
count_type=df6.Date.apply(test).value_counts()
count_type

0    135319
Name: Date, dtype: int64

In [19]:
df6[df6.Date.str.contains('March 30')]

Unnamed: 0,Id,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,...,CreditLine,AccessionNumber,Classification,Department,DateAcquired,Cataloged,ObjectID,URL,Height (cm),Width (cm)
55432,55432,U.S. Removes Worker Files,Neal Boenzi/The New York Times,8512,"American, born 1925",American,1925,0,Male,"March 30, 1956",...,The New York Times Collection,1956.2001,Photograph,Photography,2001-06-14,Y,58549,http://www.moma.org/collection/works/58549,19.4,24.1
68345,68345,Untitled,Glen Alps,124,"American, 1914–1996",American,1914,1996,Male,March 30-31 1961,...,"Gift of Kleiner, Bell & Co.",681.1967,Print,Prints & Illustrated Books,1967-12-13,Y,73091,http://www.moma.org/collection/works/73091,67.3,47.3
69571,69571,TI QUEEN,Jerome Kaplan,2993,"American, born 1920",American,1920,0,Male,March 30-April 10 1962,...,"Gift of Kleiner, Bell & Co.",870.1967,Print,Prints & Illustrated Books,1967-12-13,N,74529,,75.8,56.7
105456,88287,Untitled from the Museum in Progress project P...,Jan Knap,30960,"Czech, born 1949",Czech,1949,0,Male,newspaper published March 30,...,Linda Barth Goldstein Fund,514.2006.24,Print,Prints & Illustrated Books,2006-06-01,N,103623,,47.0,31.5
122319,105150,"Die Aktion, vol. 7, no. 13",Ottheinrich Strohmeyer,41158,,,0,0,,"March 30, 1917",...,Committee on Prints and Illustrated Books Fund...,947.2010.204,Periodical,Prints & Illustrated Books,2010-11-10,Y,144409,http://www.moma.org/collection/works/144409,30.8,23.2
137632,120463,"Le Mirliton, no. 13",Théophile-Alexandre Steinlen,5634,"French, 1859–1923",French,1859,1923,Male,"March 30, 1894",...,Grace M. Mayer Bequest,595.1997.155,Periodical,Prints & Illustrated Books,,Y,183756,http://www.moma.org/collection/works/183756,37.5,27.5


In [20]:
def clean_date(date):
    if re.search('[0-9]{4}$', date):
        return date[-4:]
    if re.search('^[0-9]{4}', date):
        return date[:4]
    elif re.search('[0-9]{4}', date):
        pos = re.search('[0-9]{4}', date).start()
        return date[pos:pos+4]
    elif re.search('[0-9]{3}\?', date):
        new_date = re.sub('\?','0',date)
        pos = re.search('[0-9]{4}', new_date).start()
        return new_date[pos:pos+4]
    elif re.search('^[a-zA-Z \,\?\.]+$', date):
        return np.nan
    elif re.search('century',date):
        return date[0]+str('00')
    else:
        return date
    
    
# Testing function
date='8th-9th century C.E.'
new_date=clean_date(date)
print(new_date)

800


In [21]:
df8=df6.copy()
before_clean2=df8.Date.nunique()
print("total unique values in date before clean 2: ", before_clean2)

df8.Date=df8.Date.apply(clean_date)

print(df8.Date.value_counts())
print(df8.Date.unique())
clean2_values=df8.Date.nunique()
print("total unique values in date after clean 2: ", clean2_values)

total unique values in date before clean 2:  8816
1966    2538
1967    2472
1969    2405
1965    2386
1968    2229
        ... 
1809       1
1600       1
1805       1
1848       1
1635       1
Name: Date, Length: 204, dtype: int64
['1896' '1987' '1903' '1980' '1976' '1968' '1900' '1978' '1905' '1906'
 '1979' '1918' '1970' '1975' '1984' '1986' '1974' nan '1917' '1923' '1930'
 '1936' '1935' '1937' '1938' '1977' '1958' '1985' '1989' '1949' '1964'
 '1991' '1941' '1965' '1981' '1983' '1988' '1992' '1915' '1953' '1910'
 '1982' '1945' '1924' '1990' '1995' '1931' '1929' '1959' '1920' '1939'
 '1993' '1996' '1952' '1921' '1957' '1972' '1956' '1962' '1925' '1960'
 '1969' '1963' '1994' '1961' '1928' '1927' '1933' '1967' '1934' '1940'
 '1946' '1955' '1997' '1922' '1942' '1954' '1973' '1926' '1932' '1947'
 '1943' '1944' '1966' '1971' '1999' '1913' '1951' '2002' '2001' '2000'
 '1886' '1950' '1901' '1948' '1912' '1908' '1902' '1904' '1916' '1998'
 '1914' '1875' '1898' '1909' '1907' '800' '700' '1600' 

In [None]:
## I won't use it - 1st way used to clean date values
df6.loc[:,'Date']=df6.loc[:,'Date'].str.replace("'",'').str.replace('.','').str.replace('early','').str.replace('s','').str.replace('c.','').str.replace('After','').str.replace('or before','').str.replace(' publihed','').str.replace('printed ','').str.replace('newpaperSeptember','').str.replace('exeted','').str.replace('Before','').str.replace(' ','')
print(df6.Date.unique())
clean1_values=df6.Date.nunique()
print("total unique values in date after clean 1: ", clean1_values)


In [22]:
## Manually cleaning inconsistent data
df9=df8.copy()
cel=df9[(df9.Date=='November 10')&(df9.Artist=='George Platt Lynes')]
df9.loc[cel.index,'Date']='1937'

In [23]:
## Manually cleaning inconsistent data
cel2=df9[(df9.Date=='newspaper published March 30')]
df9.loc[cel2.index,'Date']=np.nan
df9.loc[(df9.Artist=='Jan Knap')]

Unnamed: 0,Id,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,...,CreditLine,AccessionNumber,Classification,Department,DateAcquired,Cataloged,ObjectID,URL,Height (cm),Width (cm)
105456,88287,Untitled from the Museum in Progress project P...,Jan Knap,30960,"Czech, born 1949",Czech,1949,0,Male,,...,Linda Barth Goldstein Fund,514.2006.24,Print,Prints & Illustrated Books,2006-06-01,N,103623,,47.0,31.5


In [24]:
print(df9.Date.unique())
clean3_values=df9.Date.nunique()
print("total unique values in date after clean 3: ", clean3_values)

['1896' '1987' '1903' '1980' '1976' '1968' '1900' '1978' '1905' '1906'
 '1979' '1918' '1970' '1975' '1984' '1986' '1974' nan '1917' '1923' '1930'
 '1936' '1935' '1937' '1938' '1977' '1958' '1985' '1989' '1949' '1964'
 '1991' '1941' '1965' '1981' '1983' '1988' '1992' '1915' '1953' '1910'
 '1982' '1945' '1924' '1990' '1995' '1931' '1929' '1959' '1920' '1939'
 '1993' '1996' '1952' '1921' '1957' '1972' '1956' '1962' '1925' '1960'
 '1969' '1963' '1994' '1961' '1928' '1927' '1933' '1967' '1934' '1940'
 '1946' '1955' '1997' '1922' '1942' '1954' '1973' '1926' '1932' '1947'
 '1943' '1944' '1966' '1971' '1999' '1913' '1951' '2002' '2001' '2000'
 '1886' '1950' '1901' '1948' '1912' '1908' '1902' '1904' '1916' '1998'
 '1914' '1875' '1898' '1909' '1907' '800' '700' '1600' '1897' '1895'
 '1880' '1885' '1768' '1878' '1808' '1865' '1899' '1876' '1873' '1860'
 '1866' '1830' '1840' '1919' '1884' '1883' '1894' '1893' '1879' '1892'
 '1890' '1877' '1911' '1891' '1889' '1818' '1852' '1837' '1828' '1854'
 '17

## Guess Missing Date Values

In [26]:
df9.Date.isna().sum()

3514

In [136]:
df10=df9.copy()

null_date=df10.loc[df10.Date.isna()].index
df10.drop(null_date,axis=0,inplace=True)

In [137]:
null_artist=df10[df10.Artist.isna()]
null_artist.Date.isna().sum()

0

In [138]:
df10.Date=df10.Date.astype(int)

In [335]:
mean_date=round(df10.groupby('ConstituentID')['Date'].agg('mean'))
mean_date=mean_date.astype(int)
mean_date[mean_date.index=='27'][0]
mean_date.shape

(13980,)

In [151]:
df9[df9.Date.isna()]

Unnamed: 0,Id,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,...,CreditLine,AccessionNumber,Classification,Department,DateAcquired,Cataloged,ObjectID,URL,Height (cm),Width (cm)
77,77,Misc. objects,Ludwig Mies van der Rohe,7166,"American, born Germany. 1886–1969",American,1886,1969,Male,,...,"Mies van der Rohe Archive, gift of Ludwig Glaeser",29.1980.1-248,Mies van der Rohe Archive,Architecture & Design,1980-01-08,N,102,,,
88,88,"Skandia Cinema, Stockholm, Sweden, Perspective...",Erik Gunnar Asplund,27,"Swedish, 1885–1940",Swedish,1885,1940,Male,,...,"Gift of Ira Levy, Mrs. Donald B. Marron, and p...",43.1990,Architecture,Architecture & Design,1990-01-17,Y,126,http://www.moma.org/collection/works/126,33.000000,29.800000
89,89,"Skandia Cinema, Stockholm, Sweden, Perspective...",Erik Gunnar Asplund,27,"Swedish, 1885–1940",Swedish,1885,1940,Male,,...,"Gift of Ira Levy, Mrs. Donald B. Marron, and p...",44.1990,Architecture,Architecture & Design,1990-01-17,Y,128,http://www.moma.org/collection/works/128,21.000000,26.700000
90,90,"Public Library, Stockholm, Sweden, Elevation o...",Erik Gunnar Asplund,27,"Swedish, 1885–1940",Swedish,1885,1940,Male,,...,Gift of Marshall Cogan and purchase,45.1990,Architecture,Architecture & Design,1990-01-17,Y,130,http://www.moma.org/collection/works/130,91.400000,93.300000
91,91,"Public Library, Stockholm, Sweden, Elevation o...",Erik Gunnar Asplund,27,"Swedish, 1885–1940",Swedish,1885,1940,Male,,...,Gift of Marshall Cogan and purchase,46.1990,Architecture,Architecture & Design,1990-01-17,Y,131,http://www.moma.org/collection/works/131,90.800000,93.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152129,134960,Earth Run score,Terry Riley,4930,"American, born 1935",American,1935,0,Male,,...,Gift of Kourosh Larizadeh and Luis Pardo,SC832.2017.14,Ephemera,Media and Performance Art,2017-12-31,N,288404,,21.590043,27.940056
152130,134961,Terry Riley score on sheet music,Terry Riley,4930,"American, born 1935",American,1935,0,Male,,...,Gift of Kourosh Larizadeh and Luis Pardo,SC832.2017.19,Ephemera,Media and Performance Art,2017-12-31,N,288405,,21.590043,27.940056
152481,135312,Lost Portraits (Individual),,,,,,,,,...,,F2016.86.1,Film,Film,,N,290212,,,
152482,135313,Lost Portraits (Double),,,,,,,,,...,,F2016.86.2,Film,Film,,N,290213,,,


In [239]:
mean_date.index

Index(['1', '10', '100', '1000', '1001', '10016', '10027', '1003', '10034',
       '1004',
       ...
       '9951', '9953', '9954', '9958', '996', '9971', '9972', '9973', '998',
       '999'],
      dtype='object', name='ConstituentID', length=13980)

In [242]:
test_date

Unnamed: 0_level_0,Date
ConstituentID,Unnamed: 1_level_1
27,1917.0
27,1917.0
27,1917.0
27,1917.0
27,1917.0
27,1917.0
27,1923.0
27,1923.0
27,1923.0
27,1923.0


In [339]:
def getmean(x):
    # x string
    return mean_date[mean_date.index==x][0]

# Testing the function
test_tab=df9[['ConstituentID','Date']]
test_date=test_tab[test_tab.ConstituentID=='27']

test_date.Date=test_date.Date.fillna(test_date['ConstituentID'].apply(getmean))
test_date

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,ConstituentID,Date
78,27,1917
79,27,1917
80,27,1917
81,27,1917
82,27,1917
83,27,1917
84,27,1923
85,27,1923
86,27,1923
87,27,1923


In [340]:
test_date_bis=test_tab[test_tab.ConstituentID.isin(['27','4930'])]
test_date_bis

Unnamed: 0,ConstituentID,Date
78,27,1917.0
79,27,1917.0
80,27,1917.0
81,27,1917.0
82,27,1917.0
83,27,1917.0
84,27,1923.0
85,27,1923.0
86,27,1923.0
87,27,1923.0


In [341]:
test_date_bis.Date=test_date_bis.Date.fillna(test_date_bis['ConstituentID'].apply(getmean))
test_date_bis

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,ConstituentID,Date
78,27,1917
79,27,1917
80,27,1917
81,27,1917
82,27,1917
83,27,1917
84,27,1923
85,27,1923
86,27,1923
87,27,1923


In [338]:
df11=df9.copy()

df11.Date=df11.Date.fillna(df11['ConstituentID'].apply(getmean))

IndexError: index out of bounds

In [345]:
mean_date.describe()

count    13980.000000
mean      1964.796853
std         33.057411
min       1768.000000
25%       1945.000000
50%       1968.000000
75%       1990.000000
max       2017.000000
Name: Date, dtype: float64

In [347]:
len(mean_date)

13980

In [349]:
df11[['Date','ConstituentID']]

Unnamed: 0,Date,ConstituentID
0,1896,6210
1,1987,7470
2,1903,7605
3,1980,7056
4,1903,7605
...,...,...
152482,,
152483,,
152484,1976,3402
152485,1973,3402
