### Transcript ingestion and standardisation

Bulk ingestion of PDF transcript files in a location - e.g. Batch 1

NB: Assuming all transcripts are in the most common file format:
   
   - Front page is copyright info etc.
       - This is omitted at least once - add detection step
   - Metadata (interviewee name and regiment, date and name of transcriber)
   - Tables (from MS Word) with time stamps, text (bold indicating interviewer speaking), highlighted sections marking film breaks
   - Footers (Legasee information)

In [1]:
from os import path, listdir
import pandas as pd
import fitz

In [2]:
%cd '/home/tompickard/MiniProject/Legasee-Oral-History/'

from transcript_ingestion import page_to_ts, fancy_page_to_ts, text_to_ts

/home/tompickard/MiniProject/Legasee-Oral-History


In [3]:
# Set target folder for inputs
IN_FOLDER = '~/H_Drive/srv/studat/cdt/data/legasee/navy_veteran_transcripts'

IN_FOLDER = path.expanduser(IN_FOLDER)

In [4]:
# Set location for metadata, outputs
TARGET_FOLDER = '~/H_Drive/srv/studat/cdt/team2/data/legasee'

TARGET_FOLDER = path.expanduser(TARGET_FOLDER)

Check for existence of input and output folders (and interrupt if missing)

In [5]:
assert path.exists(IN_FOLDER)

In [6]:
assert path.exists(TARGET_FOLDER+'/metadata')
assert path.exists(TARGET_FOLDER+'/test')
assert path.exists(TARGET_FOLDER+'/train')
# Unallocated folder for transcripts of interviews for which we do not (yet) have audio
assert path.exists(TARGET_FOLDER+'/unallocated')

Read metadata

In [7]:
meta_df = pd.read_csv(TARGET_FOLDER+'/metadata/'+'master_metadata.csv', converters={'Priority Words': eval, 'Name Words' : eval})

In [8]:
# Cut down to this specific batch
BATCH_NO = 1

meta_df = meta_df[meta_df.Batch == BATCH_NO]

In [9]:
meta_df

Unnamed: 0,id,Title,Batch,Transcript,Allocation,Content,biografy,vimeo_promo_id,vimeo_description,Service Types,Project Types,Tags,related_videos,Priority Words,Name Words
0,2643.0,Admiral William O'Brien,1,1,Train/Eval,,"A remarkable interview, full of detail and opi...",116177529.0,"On PQ 17: ""I can remember vividly the scene at...",Navy,The Veterans' video archive|Keeping Britain Af...,Places>North Africa|Role>First Lieutenant|Role...,"a:14:{i:0;s:4:""2647"";i:1;s:4:""2648"";i:2;s:4:""2...","[North Africa, First Lieutenant, Navigator, WW...","[Admiral, William, O'Brien]"
1,2071.0,Alex Owens,1,1,Test,,The delightful Alex Owens provides a classic s...,90101380.0,,Navy,Keeping Britain Afloat|The Veterans' video arc...,Places>England - HMS Ganges (Stone Frigate)|Mi...,"a:7:{i:0;s:4:""2074"";i:1;s:4:""2075"";i:2;s:4:""20...","[England, HMS Ganges (Stone Frigate), Descript...","[Alex, Owens]"
2,2301.0,Cornelius Snelling,1,1,Train/Eval,,Cornelius Snelling served on the Black Swan-cl...,93458542.0,Cornelius Snelling served on the Black Swan-cl...,Navy,The Normandy Campaign|The Veterans' video arch...,Miscellaneous>Naval - Rum Ration|Battles>Opera...,"a:12:{i:0;s:4:""2304"";i:1;s:4:""2305"";i:2;s:4:""2...","[Naval, Rum Ration, Operation, Neptune / Overl...","[Cornelius, Snelling]"
3,3238.0,David Craig,1,1,Train/Eval,,David's interview isn't the easiest to listen ...,121183947.0,David Craig provides a fantastic account of hi...,Civilian,The Veterans' video archive|Keeping Britain Af...,Places>Russia|Miscellaneous>Naval Convoy - JW5...,"a:8:{i:0;s:4:""3241"";i:1;s:4:""3242"";i:2;s:4:""32...","[Russia, Naval Convoy, JW53, Ship, Russian / A...","[David, Craig]"
4,2146.0,Dennis Whitehead,1,1,Train/Eval,,Dennis Whitehead served on the C-Class Destroy...,90098682.0,It’s hard to determine if it was good or bad l...,Navy,Keeping Britain Afloat|The Veterans' video arc...,Service Type>Navy|Role>Ordinary Seaman|Places>...,"a:6:{i:0;s:4:""2149"";i:1;s:4:""2150"";i:2;s:4:""21...","[Navy, Ordinary Seaman, Mil Camp Uk, RNB Chath...","[Dennis, Whitehead]"
5,2252.0,Dick West,1,1,Train/Eval,,Dick West gives a brilliant account of his lif...,93459693.0,Dick West gives a brilliant account of his lif...,Navy,Keeping Britain Afloat|The Veterans' video arc...,Miscellaneous>Naval - Action Stations|Miscella...,"a:7:{i:0;s:4:""2255"";i:1;s:4:""2256"";i:2;s:4:""22...","[Naval, Action Stations, Naval Actions, Naval ...","[Dick, West]"
6,2242.0,Doug Shelley,1,1,Train/Eval,,Doug is a proud Chatham Rating who experiences...,93459244.0,Doug was a man with many friends in the Royal ...,Navy,The Veterans' video archive|Keeping Britain Af...,Places>Australia|Role>Chatham Rating|Role>Able...,"a:7:{i:0;s:4:""2245"";i:1;s:4:""2246"";i:2;s:4:""22...","[Australia, Chatham Rating, Able Seaman, Naval...","[Doug, Shelley]"
7,2511.0,Eric Conway,1,1,Train/Eval,,Eric Conway provides a fantastic interview det...,100918175.0,Second World War Submariners are a rare find. ...,Navy,Keeping Britain Afloat|The Veterans' video arc...,Miscellaneous>Description - Job Role|Vehicles>...,"a:9:{i:0;s:4:""2514"";i:1;s:4:""2515"";i:2;s:4:""25...","[Description, Job Role, Submarine, Incident, E...","[Eric, Conway]"
8,2623.0,Gladys Yates,1,1,Train/Eval,,We met Gladys when she visited the Luton prima...,109471660.0,"In this extract from her interview, Gladys rec...",Navy,The Veterans' video archive|Keeping Britain Af...,Service Type>Womens Royal Naval Service|Role>W...,"a:4:{i:0;s:4:""2626"";i:1;s:4:""2627"";i:2;s:4:""26...","[Womens Royal Naval Service, WRNS, Officer Ste...","[Gladys, Yates]"
9,2573.0,Gordon Hooton,1,1,Test,,Gordon ran away from home and the Navy and the...,,,Navy,Keeping Britain Afloat|The Veterans' video arc...,Places>Russia|Places>Russia - Polyarny|Places>...,"a:5:{i:0;s:4:""2576"";i:1;s:4:""2577"";i:2;s:4:""25...","[Russia, Russia, Polyarny, The Far East, Russi...","[Gordon, Hooton]"


In [10]:
exceptions = {}

# Do not skip the first page of PDF for files in this list - use when copyright frontpage is omitted
INCLUDE_FIRST = ['Catherine_Avent.pdf']

# Process files in IN_FOLDER
for fname in listdir(IN_FOLDER):
    _alloc = None
    
    #  Only want PDFs
    if fname[-4:] != '.pdf':
        pass
    
    else:
        #  Extract name
        pname = fname[:-4].replace('_',' ')
        
        #  Check against metadata
        mdets = meta_df[meta_df.Title == pname]
        #  If no match or more than 1, add to exceptions and report at the end of loop
        if len(mdets) != 1:
            print('{} metadata records found corresponding to {}. Adding to exceptions.'.format(len(mdets),fname))
            exceptions[fname] = 'Metadata'
            
        else:
            _alloc = mdets.reset_index().Allocation[0]
            _batch = mdets.reset_index().Batch[0]
            
            _doc = fitz.open(IN_FOLDER+'/'+fname)
            _transcripts = []
            
            if fname in INCLUDE_FIRST: _start_page = 0
            # Omit first page as it's copyright material / frontispiece, unless instructed otherwise
            else: _start_page = 1
            
            for page in _doc.pages(_start_page):
                try:
                    _transcripts.extend(page_to_ts(page))
                                       
                #  If get an error from the page reading function, move to exceptions
                except:
                    print('Error processing {}. Adding to exceptions.'.format(fname))
                    exceptions[fname] = 'Processing error'
                    break
            
            if fname in exceptions:
                pass
            
            else:
                df = pd.DataFrame(_transcripts,columns = ["Timestamp", "Speaker", "Transcript"])

                # Determine output subfolder
                if _batch < 0: _subfolder = '/unallocated/'
                elif _alloc == "Test": _subfolder = '/test/'
                elif _alloc == 'Train/Eval' : _subfolder = '/train/'
                else:
                    print('Cannot determine output location for {}. Adding to exceptions.'.format(fname))
                    exceptions[fname] = 'Output location'
                    break

                _outname = pname.replace(' ','_')

                df.to_csv(TARGET_FOLDER+_subfolder+'transcripts/'+_outname+'.tsv',
                          sep = '\t'
                         )

0 metadata records found corresponding to John Roche.pdf. Adding to exceptions.
0 metadata records found corresponding to Alan Lloyd.pdf. Adding to exceptions.
0 metadata records found corresponding to Mervyn Salter.pdf. Adding to exceptions.
0 metadata records found corresponding to Albert Malin.pdf. Adding to exceptions.
0 metadata records found corresponding to Pam Torrens.pdf. Adding to exceptions.
0 metadata records found corresponding to Alec Penstone.pdf. Adding to exceptions.




  page_bs = BeautifulSoup(raw_html)



0 metadata records found corresponding to Alec Pulfer.pdf. Adding to exceptions.
0 metadata records found corresponding to Alexander Owens.pdf. Adding to exceptions.
0 metadata records found corresponding to Buster Brown.pdf. Adding to exceptions.
0 metadata records found corresponding to Catherine_Avent.pdf. Adding to exceptions.
0 metadata records found corresponding to Ted Hunt.pdf. Adding to exceptions.
0 metadata records found corresponding to Colette Cook.pdf. Adding to exceptions.
0 metadata records found corresponding to Ted Rogers.pdf. Adding to exceptions.
0 metadata records found corresponding to Lord Alan West.pdf. Adding to exceptions.
0 metadata records found corresponding to William O'Brien 1.pdf. Adding to exceptions.
0 metadata records found corresponding to William O'Brien.pdf. Adding to exceptions.
0 metadata records found corresponding to William Sheppard.pdf. Adding to exceptions.
0 metadata records found corresponding to Ernest Kellaway.pdf. Adding to exceptions.


In [17]:
for k,v in exceptions.items():
    print(k, '\t', v)

John Roche.pdf 	 Metadata
Alan Lloyd.pdf 	 Metadata
Mervyn Salter.pdf 	 Metadata
Albert Malin.pdf 	 Metadata
Pam Torrens.pdf 	 Metadata
Alec Penstone.pdf 	 Metadata
Alec Pulfer.pdf 	 Metadata
Alexander Owens.pdf 	 Metadata
Buster Brown.pdf 	 Metadata
Catherine_Avent.pdf 	 Metadata
Ted Hunt.pdf 	 Metadata
Colette Cook.pdf 	 Metadata
Ted Rogers.pdf 	 Metadata
Lord Alan West.pdf 	 Metadata
William O'Brien 1.pdf 	 Metadata
William O'Brien.pdf 	 Metadata
William Sheppard.pdf 	 Metadata
Ernest Kellaway.pdf 	 Metadata
Frances McLaren.pdf 	 Metadata
Harry Card.pdf 	 Metadata
John Harrison.pdf 	 Metadata
Joy Aylard Transcript.pdf 	 Metadata
