### Analysis of the dataset

This dataset represents the descriptive metadata from the [Moving Image Archive catalogue](https://data.nls.uk/data/metadata-collections/moving-image-archive/), which is Scotland’s national collection of moving images.

In [6]:
import pandas as pd

#### Loading the CSV data into pandas

In [7]:
path_csv = "data/output/movingImageArchive.csv"
df = pd.read_csv (path_csv, sep=',')

In [8]:
## structure of the data
print(df.columns.tolist())

['title', 'author', 'place_publication', 'date', 'extent', 'credits', 'subjects', 'summary', 'details', 'link', 'geographicNames', 'contentType', 'mediaType', 'carrierType', 'generalNote', 'personalName', 'thumbnail']


In [9]:
# number of records
print(df.count())

title                20599
author                1443
place_publication    20608
date                 15575
extent               20608
credits              14889
subjects              8006
summary              20587
details              20260
link                 20608
geographicNames       4604
contentType          20608
mediaType            20608
carrierType          20608
generalNote          20608
personalName           770
thumbnail             5345
dtype: int64


In [10]:
# analysis geographic locations column 
print(df['geographicNames'].describe())

count        4604
unique        496
top       Glasgow
freq          729
Name: geographicNames, dtype: object


In [11]:
print(df["geographicNames"].unique())

['Glasgow' 'Edinburgh' 'Dunbartonshire' nan 'Glasgow -- Renfrewshire'
 'Aberdeen' 'Renfrewshire' 'Forth River' 'Glasgow -- Highlands, the'
 'Borders -- Dumfriesshire -- Edinburgh -- Fife -- Glasgow -- Stirling'
 'Dumfriesshire -- Fife -- Glasgow -- Renfrewshire' 'Ayrshire'
 'Lanarkshire' 'Edinburgh -- Glasgow -- Renfrewshire'
 'Dunbartonshire -- Glasgow -- Lanarkshire' 'Dundee' 'Bute'
 'Morayshire -- Perth' 'Borders' 'Perth' 'Highlands, the' 'Sutherland'
 'Angus -- Dundee' 'Aberdeen -- Aberdeenshire' 'Glasgow -- West Lothian'
 'Dumfriesshire' 'Ayrshire -- Dumfriesshire' 'West Lothian' 'Fife'
 'Borders -- Edinburgh -- Glasgow -- Invernesshire' 'Aberdeenshire'
 'Aberdeen -- Borders -- Edinburgh -- Fife -- Forth River -- Glasgow -- Stirling'
 'East Lothian -- Edinburgh -- Forth River -- Glasgow -- Gorbals, the'
 'Glasgow -- Perth' 'Dunbartonshire -- Glasgow' 'Shetland Islands'
 'Invernesshire'
 'Caithness -- Highlands, the -- Invernesshire -- Orkney Islands -- Outer Hebrides -- Ross-shire

In [12]:
print(df["summary"].head(10))

0    The Botanic Gardens, Glasgow with shots of the...
1    Footage of the last trams to run in Glasgow, a...
2    The story of the last Edinburgh tram.  Shots o...
3    Footage of the last tram to run in Glasgow. Th...
4    Scottish school pupils studying scientific and...
5    Glasgow University celebrates its Fifth Centen...
6    Celebrations in Glasgow attended by students f...
7    Procession of dignitaries in horse-drawn carri...
8    Harry Lauder leaves for Liverpool from London'...
9    A selection of amateur films made in the early...
Name: summary, dtype: object


#### What information is available as place of publication?

The information stored in the metadata field place of publication is the same string in all the records.

In [13]:
print(df["place_publication"].unique())

['[Place of production not identified]'
 '[Place of production not identified] : ']


#### Let's check the media, carrier and content type metadata fields

In [14]:
print(df["mediaType"].unique())
print(df["carrierType"].unique())
print(df["contentType"].unique())

['unspecified -- rdamedia']
['unspecified  -- rdacarrier']
['two-dimensional moving image -- rdacontent']


#### How many thumbnails are available?

In [16]:
print(df["thumbnail"].describe())

count                                                  5345
unique                                                 5345
top       http://deriv.nls.uk/dcn19/1358/3808/135838082....
freq                                                      1
Name: thumbnail, dtype: object


#### Let's check the subjects

In [17]:
# get unique values
subjects = pd.unique(df['subjects'].str.split(' -- ', expand=True).stack()).tolist()
print("Total unique subjects:" + str(len(subjects)))
for s in sorted(subjects, key=str.lower):
    print(s)

Total unique subjects:91
Agriculture
Air displays and shows
Air Raids
Aircraft see also Helicopters
Airports
Animals
Architecture and Buildings
Art and Artists, general  
Arts and Crafts
Birds
British Empire, the
Broadcasting, general
Buddhism
Bulldozers
Bus Stations and Depots
Buses and Coaches, general
Butchers and Butcher Shops
Cafeterias and Canteens
Camping
Canals
Canoeing
Carriages
Celebrations, Traditions and Customs
Celts and Celtic Culture
Ceremonies
Cheese and Cheese Making
Children and Infants
Christmas  see also New Year
Construction and Engineering
Crime, Punishment and Law Enforcement
Dentistry
Depression, the
Disillusionment
Easter
Education
Emotions, Attitudes and Behaviour
Employment, Industry and Industrial Relations
Environment
Ferries
Fire Service
Fish and Fishing
Fish Gutting
Fish Markets
Fishing Boats
Fishwives
Food and Drink
Forth River
Gaelic
Healthcare
Highland Games
Hogmanay
Holiday Camps
Home Guard
Home Life
Housing and Living Conditions
Institutional Care
La

In [18]:
# Split the subjects and obtain the number of occurrences
subject_count = df['subjects'].str.split(' -- ').apply(lambda x: pd.Series(x).value_counts()).sum().astype('int').sort_values(ascending=False).to_frame().reset_index(level=0)

# Add the columns
subject_count.columns = ['subject', 'count']

# Show bar chart
display(subject_count.style.bar(subset=['count'], color='#d65f5f').set_properties(subset=['count'], **{'width': '300px'}))

Unnamed: 0,subject,count
0,Gaelic,1943
1,Leisure and Recreation,1145
2,Transport,732
3,"Employment, Industry and Industrial Relations",730
4,Sporting Activities,625
5,Ships and Shipping,567
6,Education,525
7,"Celebrations, Traditions and Customs",497
8,Tourism and Travel,459
9,"Media, Communication and the Creative Industries",441


## References

- https://pymarc.readthedocs.io/en/latest/#api-docs
- https://www.loc.gov/marc/bibliographic/