# 📦 genre

[genre_1](#genre_1)
- [get overall pass/fail scores](#g1_all)
- [get pass/fail scores by division](#g1_scoresbydiv)
- [look at failing scores by collection](#g1_scoresbycoll)
- [look at failing scores within a collection](#g1_scoresincoll)

[genre_2](#genre_2)
- [get overall pass/fail scores](#g2_all)
- [get pass/fail scores by division](#g2_scoresbydiv)
- [look at failing scores by collection](#g2_scoresbycoll)
- [look at failing scores within a collection](#g2_scoresincoll)

[co-occurrences](#co)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('min-mandatory-score_2016-07-08.csv')

In [2]:
# testing 1-2 1-2
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 965066 entries, 0 to 965065
Data columns (total 25 columns):
uuid              965066 non-null object
mms_id            965066 non-null int64
mms_type          965066 non-null object
coll_id           962212 non-null float64
division          965066 non-null object
title_1           965066 non-null int64
title_2           965066 non-null int64
title             965066 non-null float64
typeOfResource    965066 non-null int64
genre_1           965066 non-null int64
genre_2           965066 non-null int64
genre             965066 non-null float64
date_1            965066 non-null int64
date_2            965066 non-null int64
date_3            965066 non-null int64
date_4            965066 non-null int64
date              965066 non-null float64
identifier        965066 non-null int64
location_1        965066 non-null int64
location_2        965066 non-null int64
location_3        965066 non-null int64
location_4        965066 non-null int6

In [3]:
# check that basic completeness scores look OK
df[['title_1','typeOfResource', 'identifier', 'date_1', 'genre_1', 'location_1']].describe()

Unnamed: 0,title_1,typeOfResource,identifier,date_1,genre_1,location_1
count,965066.0,965066.0,965066.0,965066.0,965066.0,965066.0
mean,0.999995,0.981278,0.592315,0.598546,0.753671,0.917343
std,0.002276,0.135542,0.491404,0.490193,0.430873,0.275362
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,0.0,0.0,1.0,1.0
50%,1.0,1.0,1.0,1.0,1.0,1.0
75%,1.0,1.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0


##genre_1<a id="genre_1"></a>
####assertion: presence of at least one genre element
- [get overall pass/fail scores](#g1_all)
- [get pass/fail scores by division](#g1_scoresbydiv)
- [look at failing scores by collection](#g1_scoresbycoll)
- [look at failing scores within a collection](#g1_scoresincoll)


####get overall pass/fail scores<a id="g1_all"></a>

In [4]:
df.genre_1.value_counts()

1    727342
0    237724
dtype: int64

####look at failing scores by division<a id="g1_scoresbydiv"></a>

In [5]:
# see genre_1 pass/fail by division
genre_1_table = pd.pivot_table(df,index=['division'],columns=['genre_1'], aggfunc={'genre_1':len},fill_value=0)
genre_1_table

Unnamed: 0_level_0,genre_1,genre_1
genre_1,0,1
division,Unnamed: 1_level_2,Unnamed: 2_level_2
Art and Picture Collection,33263,11259
Billy Rose Theatre Division,7275,157149
Carl H. Pforzheimer Collection of Shelley and His Circle,14525,2162
Children's Center at 42nd St,37,4
Dorot Jewish Division,2148,4934
Dorothy and Lewis B. Cullman Center for Scholars & Writers,0,81
General Research Division,37070,14710
George Arents Collection,807,102123
Henry W. and Albert A. Berg Collection of English and American Literature,3340,9715
"Irma and Paul Milstein Division of United States History, Local History and Genealogy",6954,43049


In [6]:
# look at failing genre_1 scores by division in descending order
# len = total rows and sum = rows that passed
aggregation = {
    'genre_1':{
        'failing_scores1':lambda x:len(x)-sum(x)
    }
}
# create dataframe that groups by division and aggregrate by dictionary
failed_genres = df.groupby('division').agg(aggregation)
# subtract scores from one column to the other
failed_genre_1 = failed_genres['genre_1']['failing_scores1']
failed_genre_1.order(ascending=False)

division
General Research Division                                       37070
Art and Picture Collection                                      33263
Manuscripts and Archives Division                               27429
The Miriam and Ira D. Wallach Division of Art, Prints and Photographs: Print Collection    17364
Slavic and East European Collections                            16989
Carl H. Pforzheimer Collection of Shelley and His Circle        14525
Music Division                                                  10759
Spencer Collection                                               8939
Schomburg Center for Research in Black Culture, Manuscripts, Archives and Rare Books Division     8517
Billy Rose Theatre Division                                      7275
The Miriam and Ira D. Wallach Division of Art, Prints and Photographs: Art & Architecture Collection     7073
Irma and Paul Milstein Division of United States History, Local History and Genealogy     6954
Rare Book Division        

####look at failing scores by collection<a id="g1_scoresbycoll"></a>

In [2]:
# look at failing genre_1 scores by collection
df[df['genre_1'] == 0.0][['mms_id', 'coll_id', 'division']].sort('coll_id').head()

Unnamed: 0,mms_id,coll_id,division
1967,3017583,25781,Manuscripts and Archives Division
801582,4865830,25781,Manuscripts and Archives Division
801581,4865829,25781,Manuscripts and Archives Division
801580,4865828,25781,Manuscripts and Archives Division
801579,4865827,25781,Manuscripts and Archives Division


In [5]:
# look at genre_1 scores by collection
genres_table = pd.pivot_table(df,index=['coll_id'],columns=['genre_1'],aggfunc={'genre_1':len},fill_value=0)
genres_table.head()

Unnamed: 0_level_0,genre_1,genre_1
genre_1,0,1
coll_id,Unnamed: 1_level_2,Unnamed: 2_level_2
25776,0,871
25779,0,100951
25780,0,80542
25781,15790,0
25782,0,72689


####look at failing scores within a collection<a id="g1_scoresincoll"></a>

In [8]:
# look at failing genre_1 scores within a collection
fg1 = df[df['genre_1'] == 0.0][['mms_id', 'coll_id', 'division']]

# 25792 is the Friedman-Abeles photograph collection
fg1_25792 = fg1[fg1['coll_id'] == 25792][['mms_id','coll_id', 'division']].sort(['mms_id'])

# use .count() for report stats; remove .count() for content
fg1_25792.count()

mms_id      0
coll_id     0
division    0
dtype: int64

##genre_2<a id='genre_2'></a>
####assertion: presence of an 'authority' attribute
- [get overall pass/fail scores](#g2_all)
- [get pass/fail scores by division](#g2_scoresbydiv)
- [look at failing scores by collection](#g2_scoresbycoll)
- [look at failing scores within a collection](#g2_scoresincoll)

####get overall pass/fail scores<a id="g2_all"></a>

In [9]:
df.genre_2.value_counts()

1    704468
0    260598
dtype: int64

####look at failing scores within a division<a id="g2_scoresbydiv"></a>

In [10]:
# get number of records failing genre_2
# records failing genre_2 will include those that failed genre_1, so we need to 
# calculate the difference between records that failed genre_2 and genre_1

# len = total rows and sum = rows that passed
aggregation = {
    'genre_1':{
        'failing_scores1':lambda x:len(x)-sum(x)
        },
    'genre_2':{
        'failing_scores2':lambda x:len(x)-sum(x)
    }
}
# create dataframe that groups by division and aggregrate by dictionary
failed_genre2 = df.groupby('division').agg(aggregation)

# subtract scores from one column to the other
difference = failed_genre2['genre_2']['failing_scores2']-failed_genre2['genre_1']['failing_scores1']

difference.sum()

22874

In [11]:
difference.order(ascending=False)

division
The Miriam and Ira D. Wallach Division of Art, Prints and Photographs: Print Collection    8881
General Research Division                                       6288
Art and Picture Collection                                      1943
The Miriam and Ira D. Wallach Division of Art, Prints and Photographs: Photography Collection     863
The Miriam and Ira D. Wallach Division of Art, Prints and Photographs: Art & Architecture Collection     722
Dorot Jewish Division                                            683
Jerome Robbins Dance Division                                    587
Schomburg Center for Research in Black Culture, Photographs and Prints Division     384
Science, Industry and Business Library: General Collection       378
Manuscripts and Archives Division                                333
Henry W. and Albert A. Berg Collection of English and American Literature     332
Billy Rose Theatre Division                                      231
Schomburg Center for Research i

####look at failing scores by collection<a id="g2_scoresbycoll"></a>

In [9]:
# look at failing genre_2 scores by collection
df[df['genre_2'] == 0.0][['mms_id', 'coll_id', 'division']].sort('coll_id').head()

Unnamed: 0,mms_id,coll_id,division
1967,3017583,25781,Manuscripts and Archives Division
801375,4865623,25781,Manuscripts and Archives Division
45664,3087987,25781,Manuscripts and Archives Division
801374,4865622,25781,Manuscripts and Archives Division
801373,4865621,25781,Manuscripts and Archives Division


####look at failing scores within a collection<a id="g2_scoresincoll"></a>

In [13]:
# look at failing genre_2 scores within a collection
fg2 = df[df['genre_2'] == 0.0][['mms_id', 'coll_id', 'division']]

# 25792 is the Friedman-Abeles photograph collection
fg2_25792 = fg2[fg2['coll_id'] == 25792][['mms_id','coll_id', 'division']].sort(['mms_id'])

# use .count() for report stats; remove .count() for content to export to csv
fg2_25792.count()

mms_id      0
coll_id     0
division    0
dtype: int64

##co-occurrences<a id="co"></a>

In [14]:
print 'Records with a genre absent and an identifier present:'
nogenre_ids = df[(df.genre_1 == 0.0) & (df.identifier == 1.0)]
print len(nogenre_ids)

print 'Records with a genre present but no resource type:'
genre_notype = df[(df.genre_1 == 1.0) & (df.typeOfResource == 0.0)]
print len(genre_notype)

print 'Records with both a genre and a resource type:'
genre_type = df[(df.genre_1 == 1.0) & (df.typeOfResource == 1.0)]
print len(genre_type)

Records with a genre absent and an identifier present:
186485
Records with a genre present but no resource type:
4597
Records with both a genre and a resource type:
722745
