# Anlysis of museum collection

## Dataset and Motivation Slide

    - Out dataset is from Metropolitan Museum of Arts Collection. The dataset is in csv format.
    - We collect the data by downloading the file from the website https://github.com/metmuseum/openaccess/.
    - The meta data was in a mess with many problems such as the data was put in the wrong places, missing values, missing documentation, inconsistent information, possible duplication, mixed text and numeric data.

## Actual Task Definition / Research Question

    -The real-world problem we are interested with is the relation between deparment and collections, what are the types of the object, the relation between years and museum collection activities.
        
### The relation between deparment and collections
    - The input would be the index and the value dataframe of departments from the Department attribute.
    - The output would be a plot indicates the relation between the number of occurances and names of departments, i.e. The catalog of the museum collections.

### The types of the object
    - The input would be the index and the value dataframe of collections from the Object_Name attribute, top ten selections.
    - The output would be a plot indicates the top ten types of objects of this museum.

### The relation between years and museum collection avtivities
    - The input would be the year number from the Credit_Line attribute, and the frequency will be count inside the Series.
    - The output would be a plot indicates since the museum has the records, what years does the museum collecting arts from outside.

## Literature Reviews


### Import all the module required

In [None]:
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
from collections import Counter, defaultdict
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from math import inf

### Import the data from the website

In [None]:
collection_df = pd.read_csv("https://media.githubusercontent.com/media/metmuseum/openaccess/master/MetObjects.csv")

## Data clean

The data now looks fine. We intend to do some data clean work as follows:
+ Check "Object Begin Date" and "Object End Date" with "Object Date"
+ Check "Artist Begin Date" and "Artist End Date" with "Artist Display Bio"
+ Check "Artist Display Name" with "Artist Alpha Sort", then delete the later one.
+ Merge "Artist Role" with "Artist Prefix"


In [None]:
## Import all the help function we need to use.
import data_clean_functions as dcf

In original columns, some column names have space in it, which could be difficult to call. So we modify the names of columns for indexing convenience.

In [None]:
collection_df.columns = ['_'.join(col_name.split()) for col_name in collection_df]

### First look of the data

In [None]:
collection_df.dtypes

We could check number of NaNs in all the columns first and try to ingnore those columns which the number of NaNs is no more than 10%.

In [None]:
collection_df.isna().sum()

In [None]:
filtered_df =  collection_df.loc[:, collection_df.isna().sum(axis = 0) < len(collection_df.index)*0.1 ]
filtered_df

### Check the amount of repository
We already know the Metropolitan Museum of Art in New York has only one repository. So we check this in case of spelling mistakes.

In [None]:
collection_df.Repository.value_counts()

Just one row index name. This meets our expectation.

### Extract the year information in Credit_Line column
Extract the credit year from Credit_Line columns and use the year to add an new column called 'Credit_Year' with timeline data type.

In [None]:
def credit_year(row):
    if pd.notna(row):
        ## find the year in string type document
        year_str = re.compile('\d{4}').findall(row.split(',')[-1])
        if year_str:
            year_int = int(year_str[0])
            ## filter years according to the found year of the museum
            if 1870 <= year_int <= 2019:
                return pd.Timestamp(str(year_int))
    return np.nan
collection_df['Credit_Year'] = collection_df.Credit_Line.apply(credit_year)

### A light check of Artist Display Name
The Artist_Display_Name column shoud correspond to the Artist_Alpha_Sort column. These two columns should keep the same content, excepting from abbreviation, alpha sorting, title ignorance.

I try to compare the tokenized Artist_Display_Name and Artist_Alpha_Sort column. Compare the percentage of similarity and set the roughly convincing rate to 60% due to abbreviation and title existence. The actually spelling mistake should get more than that.

In [None]:
def check_name(row):
    if pd.isna(row.Artist_Alpha_Sort):
        return True
    else:
        r1 = re.findall(r'\w+',row.Artist_Display_Name.lower())
        r2 = re.findall(r'\w+',row.Artist_Alpha_Sort.lower())
        count = 0
        for word in r1:
            if word in r2:
                count += 1
        if count/len(r1) > 0.6:
            return True
        count = 0
        for word in r2:
            if word in r1:
                count += 1
        if count/len(r2) > 0.6:
            return True
        else:
            return False
collection_df['Artist_Name_Check'] = collection_df[collection_df.Artist_Display_Name.notna()].apply(check_name,1)

### Check the Artist_Role column
We intend to check th spelling mistake of Artist_Role column and merge the Artist_Prefix column to it. This should give us a thinking of some most popular roles in history.

In [None]:
## expanding contractions
contraction_patterns=[(r'can\'t', 'cannot'),
                    (r'haven\'t', 'have not'),
                    (r'(\w+)\'ll', '\g<1> will'),
                    (r'(\w+)\'re', '\g<1> are')]

class contraction_replacer(object):
    def __init__(self, contraction_patterns):        
        # store compiled regex object
        self._contraction_regexes = [(re.compile(p), replaced_text) for p, replaced_text in contraction_patterns]
        
    def do_contraction_normalization(self, text):
        for contraction_regex, replaced_text in self._contraction_regexes:
            text = contraction_regex.sub(replaced_text,text)
        return text

def get_noun(row):
    sample_contraction_replacer = contraction_replacer(contraction_patterns)
    ## word tokenize
    if pd.isna(row):
        return row
    else:
        text = re.sub('\|',' ',row)
    words = nltk.tokenize.word_tokenize(sample_contraction_replacer.do_contraction_normalization(text))
    words = set(words)

    ## stop words removal
    stopwords = nltk.corpus.stopwords.words('english')
    words = [w for w in words if w not in stopwords]

    ## lemmatization
    wnetl = WordNetLemmatizer()

    for i in range(len(words)):
        if not wordnet.synsets(words[i]):
            nword = wnetl.lemmatize(words[i], 'n')
            if wordnet.synsets(nword):
                words[i] = nword
    return '|'.join(words)

collection_df.Artist_Role = collection_df.Artist_Role.apply(get_noun,1)

## Data analysis & Data Visualization

### Explore the general information of the data set
Check the percentage of highlighted collections from the Is_Highlight column and the percentage of public domain collections from the Is_Public_Domain columns

In [None]:
highlight_se = collection_df.Is_Highlight.value_counts()
quantile_highlight = highlight_se[1]/(highlight_se[0] + highlight_se[1])
public_se = collection_df.Is_Public_Domain.value_counts()
quantile_public = public_se[1]/(public_se[0] + public_se[1])
print(f'The percentage of highlighted collections is {quantile_highlight * 100}%')
print(f'The percentage of public domain collections is {quantile_public * 100}%')

From the results we can find that the highlighted collections is no more than 0.40%. The no more than half of the collections are in public domain.

### Explore the possible spelling mistake in Artist_Display_Name
This is a really rought check of the possible spelling mistake in Artist_Display_Name columns. Under the pre-condition in data cleaning process, only a few spelling mistake exits, which require human check.

In [None]:
print(collection_df.Artist_Name_Check.value_counts(dropna=False))
collection_df[['Artist_Name_Check','Artist_Display_Name','Artist_Alpha_Sort']].sample(10)

### Explore the types of collections
The Object_Name column shows the type of each object. We are interest to find the larggest amount of object types in the museum collection.

In [None]:
top_ten_collections = collection_df.Object_Name.value_counts(dropna=False).head(10)
top_ten_collections

In [None]:
plt.figure(figsize=(15,5))
sns.barplot(x = top_ten_collections.values, y = top_ten_collections.index, alpha=0.8, orient='h')
plt.title('Top Ten Collection Types in The Museum')
plt.xlabel('Number of Occurrences', fontsize=12)
plt.ylabel('Collection Types', fontsize=12)
plt.show()

### Explore the department
The Department column shows the department where each object belongs to. We are interest to take a loot at the collections of each department.

In [None]:
## explore the departement
departments = collection_df.Department.value_counts()
departments

In [None]:
plt.figure(figsize=(15,5))
sns.barplot(x = departments.values, y = departments.index, alpha=0.8, orient='h')
plt.title('Quantiles of Collections in Different Departments')
plt.xlabel('Number of Occurrences', fontsize=12)
plt.ylabel('Names of Departments', fontsize=12)
plt.show()

### Explore the frequency of credit line
The Credit_Year column we build in data cleaning section shows the the year of the collection been collected. We want to look at the frequency of the credit to check if the frequency increases or decreases with time flows.

In [None]:
freq_df = collection_df.Credit_Year.value_counts()
ax = freq_df.plot(figsize = (15,5))
freq_df.rolling(window = 30).mean().plot(ax= ax, color='green', label = 'frequnecy')

### Explore the most popular artist role

In [None]:
def select_top_ten_roles(df):
    roles = defaultdict(int)
    for i in df[df.Artist_Role.notna()].index:
        for role in df.Artist_Role.loc[i].split('|'):
            roles[role] += 1
    x, y = [], []
    upper_bound = inf
    for _ in range(10):
        lar = -inf
        for key,val in roles.items():
            if lar < val < upper_bound:
                lar = val
                temp = key
        upper_bound = lar
        x.append(lar)
        y.append(temp)
    return x, y
x, y = select_top_ten_roles(collection_df)

In [None]:
plt.figure(figsize=(15,5))
sns.barplot(x = x, y = y, alpha=0.8, orient='h')
plt.title('Artist Role Count')
plt.xlabel('Number of Occurrences', fontsize=12)
plt.ylabel('Names of Roles', fontsize=12)
plt.show()

# Analysis conclusion


## Top Ten Collection Types in The Museum
    - From the plot we can clrearly see that the top ten colection types are print, photograph, drawing, book, fragment, Kylix fragment, piece, painting, negative, Baseball card/paint.
    - We use the value and index of the Series to be the x-axis and y-axis number.

## Quantiles of Collections in Different Departments
    - From the plot we can easily analysis each department's the amount of collection, and we can see the top collection department is Drawings and Paints.
    - We also use the value and index of the Series to be the x-axis and y-axis number.

## Frequency of Contribution Depend on Years
    - From the plot we can state that as year goes, the museum keep collecting arts from people, even though there are some years that have more frequency of collecting activities, the most activities are not changable.
    - We first turn the Series into list, then get the year numbers, count the frequency of years appear on the lists, finally plot the year and the frequency value.

# Analysis conclusion


## Top Ten Collection Types in The Museum
    - From the plot we can clrearly see that the top ten colection types are print, photograph, drawing, book, fragment, Kylix fragment, piece, painting, negative, Baseball card/paint.
    - We use the value and index of the Series to be the x-axis and y-axis number.

## Quantiles of Collections in Different Departments
    - From the plot we can easily analysis each department's the amount of collection, and we can see the top collection department is Drawings and Paints.
    - We also use the value and index of the Series to be the x-axis and y-axis number.

## Frequency of Contribution Depend on Years
    - From the plot we can state that as year goes, the museum keep collecting arts from people, even though there are some years that have more frequency of collecting activities, the most activities are not changable.
    - We first turn the Series into list, then get the year numbers, count the frequency of years appear on the lists, finally plot the year and the frequency value.