#💬 titleInfo 

[title_1](#title_1)
- [get overall pass/fail scores](#t1_all)
- [look at failing scores by collection](#t1_scoresbycoll)

[title_2](#title_2)
- [get overall pass/fail scores](#t2_all)
- [get pass/fail scores by division](#t2_scoresbydiv)
- [look at failing scores by collection](#t2_scoresbycoll)
- [look at failing scores within a collection](#t2_scoresincoll)

[co-occurrences](#co)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('min-mandatory-score_2016-07-08.csv')

In [2]:
# testing 1-2 1-2
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 965066 entries, 0 to 965065
Data columns (total 25 columns):
uuid              965066 non-null object
mms_id            965066 non-null int64
mms_type          965066 non-null object
coll_id           962212 non-null float64
division          965066 non-null object
title_1           965066 non-null int64
title_2           965066 non-null int64
title             965066 non-null float64
typeOfResource    965066 non-null int64
genre_1           965066 non-null int64
genre_2           965066 non-null int64
genre             965066 non-null float64
date_1            965066 non-null int64
date_2            965066 non-null int64
date_3            965066 non-null int64
date_4            965066 non-null int64
date              965066 non-null float64
identifier        965066 non-null int64
location_1        965066 non-null int64
location_2        965066 non-null int64
location_3        965066 non-null int64
location_4        965066 non-null int6

In [3]:
# check that basic completeness scores look OK
df[['title_1','typeOfResource', 'identifier', 'date_1', 'genre_1', 'location_1']].describe()

Unnamed: 0,title_1,typeOfResource,identifier,date_1,genre_1,location_1
count,965066.0,965066.0,965066.0,965066.0,965066.0,965066.0
mean,0.999995,0.981278,0.592315,0.598546,0.753671,0.917343
std,0.002276,0.135542,0.491404,0.490193,0.430873,0.275362
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,0.0,0.0,1.0,1.0
50%,1.0,1.0,1.0,1.0,1.0,1.0
75%,1.0,1.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0


##title_1<a id="title_1"></a>
####assertion: presence of at least one title element
- [get overall pass/fail scores](#t1_all)
- [look at failing scores by collection](#t1_scoresbycoll)

####get overall pass/fail scores<a id="t1_all"></a>

In [18]:
# record has at least one title element
df.title_1.value_counts()

1    965061
0         5
dtype: int64

####look at failing scores by collection<a id="t1_scoresbycoll"></a>

In [19]:
# due to size of set, can skip by division and look at failing title_1 scores by collection
# see failing title_1 scores by collection
df[df['title_1'] == 0.0][['mms_id', 'coll_id', 'division']]

Unnamed: 0,mms_id,coll_id,division
948956,5073280,44210,Jerome Robbins Dance Division - Audio and Movi...
949085,5073444,47312,Jerome Robbins Dance Division - Audio and Movi...
950287,5075118,47312,Jerome Robbins Dance Division - Audio and Movi...
957783,5087690,47614,Billy Rose Theatre Division
959229,5091610,48071,Manuscripts and Archives Division


##title_2<a id="title_2"></a>
####assertion: record has one title marked usage='primary' 
- [get overall pass/fail scores](#t2_all)
- [get pass/fail scores by division](#t2_scoresbydiv)
- [look at failing scores by collection](#t2_scoresbycoll)
- [look at failing scores within a collection](#t2_scoresincoll)

####get overall pass/fail scores<a id="t2_all"></a>

In [3]:
# record has one title marked usage='primary'
df.title_2.value_counts()

1    951214
0     13852
dtype: int64

####look at failing scores by division<a id="t2_scoresbydiv"></a>

In [21]:
# get number of records failing title_2
# records failing title_2 will include those that failed title_1; 
# so we need to calculate the difference between records that failed title_2 and title_1

# len = total rows and sum = rows that passed
aggregation = {
    'title_1':{
        'failing_scores1':lambda x:len(x)-sum(x)
        },
    'title_2':{
        'failing_scores2':lambda x:len(x)-sum(x)
    }
}
# create dataframe that groups by division and aggregrate by dictionary
failed_title2 = df.groupby('division').agg(aggregation)

# subtract scores from one column to the other
difference = failed_title2['title_2']['failing_scores2']-failed_title2['title_1']['failing_scores1']
difference.order(ascending=False)

division
Billy Rose Theatre Division                                     13091
The Miriam and Ira D. Wallach Division of Art, Prints and Photographs: Print Collection      189
Manuscripts and Archives Division                                 142
Jerome Robbins Dance Division                                      90
Schomburg Center for Research in Black Culture, Photographs and Prints Division       65
Jerome Robbins Dance Division - Audio and Moving Image             52
Art and Picture Collection                                         50
Spencer Collection                                                 37
Music Division                                                     28
Rodgers and Hammerstein Archives of Recorded Sound                 24
Schomburg Center for Research in Black Culture, Manuscripts, Archives and Rare Books Division       23
Rare Book Division                                                 18
General Research Division                                          13
Sc

####look at failing scores by collection<a id="t2_scoresbycoll"></a>

In [22]:
# look at failing title_2 scores by collection
df[df['title_2'] == 0.0][['mms_id', 'coll_id', 'division']].sort('coll_id')

Unnamed: 0,mms_id,coll_id,division
821798,4887857,25780,Billy Rose Theatre Division
821843,4887904,25780,Billy Rose Theatre Division
821840,4887901,25780,Billy Rose Theatre Division
821838,4887899,25780,Billy Rose Theatre Division
821836,4887897,25780,Billy Rose Theatre Division
821834,4887895,25780,Billy Rose Theatre Division
821830,4887891,25780,Billy Rose Theatre Division
821829,4887890,25780,Billy Rose Theatre Division
821828,4887888,25780,Billy Rose Theatre Division
821827,4887887,25780,Billy Rose Theatre Division


####look at failing scores within a collection<a id="t2_scoresincoll"></a>

In [23]:
# look at failing title_2 scores within a collection
ft2 = df[df['title_2'] == 0.0][['mms_id', 'coll_id', 'division']]

# 25792 is the Friedman-Abeles photograph collection
ft2_25792 = ft2[ft2['coll_id'] == 25792][['mms_id','coll_id', 'division']].sort(['mms_id'])

# use .count() for report stats; remove .count() for data
ft2_25792.count()

mms_id      11980
coll_id     11980
division    11980
dtype: int64

In [24]:
# look at failing title_2 scores within a collection
ft2 = df[df['title_2'] == 0.0][['mms_id', 'coll_id', 'division']]

# 25790 is the Martha Swope photographs collection
ft2_25790 = ft2[ft2['coll_id'] == 25790][['mms_id','coll_id', 'division']].sort(['mms_id'])

# use .count() for report stats; remove .count() for data
ft2_25790.count()

mms_id      832
coll_id     832
division    832
dtype: int64

###write results to csv

In [None]:
# write to csv
titles_by_coll = pd.pivot_table(df,index=['coll_id'],columns=['title'],aggfunc={'title':len},fill_value=0)
titles_by_coll.to_csv('titles_by_coll.csv')

### co-occurrences<a id="co"></a>

In [4]:
print 'Records with title element absent but identifier present:'
ids_notitles = df[(df.identifier == 0.0) & (df.title == 0.0)]
print len(ids_notitles)

print 'Records with title element absent but shelf locator present:'
notitle_loc3 = df[(df.title == 0.0) & (df.location_3 == 1.0)]
print len(notitle_loc3)

Records with title element absent but identifier present:
0
Records with title element absent but shelf locator present:
2
