# Malcolm Fraser Radio Talks 

This collection is available for download from the University of Melbourne: [Malcolm Fraser collection at the University of Melbourne Archives.](https://archives.unimelb.edu.au/explore/collections/malcolmfraser/explore/radiotalks) (at the bottom of that page there is a link to download the data in .txt format).  The data is distributed
as a zip file containing multiple text files, one for each speech.  This notebook demonstrates how to read this 
data into Python and apply a Named Entity Recognition pipeline to it.

First we load the required modules.

In [None]:
import os
import spacy
import csv
import geocoder
import pandas as pd
import zipfile
from urllib.request import urlopen
import utils
%load_ext autoreload
%autoreload 2

In [None]:
spacy.load('foobar')

The data will be stored in the data directory, the folowing code downloads the zip file and unpacks 
it into the data directory.  Once you run this you can browse the [data](data/UMA_Fraser_Radio_Talks) directory to see the files.  This is a good example of handling data in a notebook that you want to share.  We can't re-publish
the data but we can provide a link and the code to download and prepare the data for analysis. 

In [None]:
dataurl = "https://archives.unimelb.edu.au/__data/assets/text_file/0006/1717746/UMA_Fraser_Radio_Talks.zip"
datafile = 'data/UMA_Fraser_Radio_Talks.zip'
with urlopen(dataurl) as response:
    data = response.read()
    with open(datafile, 'wb') as out:
        out.write(data)

with zipfile.ZipFile(datafile, 'r') as zip_ref:
    zip_ref.extractall('data')

Since we want to work on the files as a collection, we need to get a list of the files to process. The python `os` module provides `listdir` which gives us a list of filenames in a given directory. 

In [None]:
datadir = 'data/UMA_Fraser_Radio_Talks/'
files = os.listdir(datadir)

To begin working with this data we need to take a look at it.  I can open individual files in a text
editor but I can also take a look in this notebook to see what the contents are.  This next cell reads
the text in the first file in our list and prints it.  From this we can see that there is a metadata section
at the start of each file with a few fields that might be of interest.  

In [None]:
samplefile = files[0]
with open(os.path.join(datadir, samplefile), errors='ignore') as fd:
    text = fd.read()
print(text[:500])

Given this structure we can write a function to read a single file from this collection and strip off
the metadata section.   We will have the code parse the metadata into fields and values.   The result of this
function will be a dictionary representing the file with fields for the metadata and for the text.  We've
also added a field containing the filename as an identifier for each text.  

This function is very specific to this file format but similar code could be used for other formats.  

In [None]:
def read_fraser_text(filename):
    """Read a file from the Malcolm Fraser collection
    Return a dictionary with metadata and fields for the text of the file"""

    # define the initial dictionary
    meta = {
            'text': "",
            'filename': filename
           }
    # now read the file using a utf-8 encoding and ignore any errors (usually wierd characters)
    with open(os.path.join(datadir, filename), encoding='utf-8', errors='ignore') as fd:

        inheader = True  # flag that is True until we finish reading the header lines
        for line in fd.readlines():
            if inheader:
                # if we are in the header, try to extract the metadata from fields that don't start with <!
                if not line.startswith('<!'):
                    words = line.split(':')
                    meta[words[0]] = ":".join(words[1:]).strip()
                if line.startswith("<!--end metadata-->"):
                    # end of the header
                    inheader = False
            else:
                # add this line to the text, note that we strip off newlines and 
                # add a space to the line, this cleans it a bit for spacy
                meta['text'] += line.strip() + " "
    
    return meta


We can now apply this function to all of the filenames and collect the result in a list, then convert that to 
a Pandas dataframe for later processing

In [None]:
data = [read_fraser_text(fn) for fn in files]
fraser = pd.DataFrame(data)
fraser.head()

## Named Entity Recognition

Now that we have the data in a standard form, we can apply the NER process to the text.   The utility function
takes the data frame we created `fraser` and the name of the column containing the text and that containing
the identifier. The result is a new dataframe containing the entities recognised in the text.

In [None]:
entities = utils.apply_ner(fraser, textcol='text', ident='filename')
entities.head()

Given these entities we can select just the `GPE` entities - the names of places.  We look at the shape of
this dataframe to see how many of these there are. The next cell then generates a table that counts the
number of occurences of each placename in the text and shows the top 30 places. 

As would be expected, Fraser talks a lot about Australia and the States.  The US and the Commonwealth are the
most common international mentions. 

In [None]:
locations = entities[entities.type=="GPE"]
print(locations.shape)
locations.entity.value_counts()[:30]

We can do the same exercise for the names of people in the text to give an indication of who Fraser was talking about.  Note the errors creeping in here with Canberra and Viet Cong recognised as person names.  

In [None]:
people = entities[entities.type=="PERSON"]
print(people.shape)
people.entity.value_counts()[:20]

## Summary

This notebook has illustrated the process of reading a collection of text documents into Python 
and running a Named Entity Recognition process over the texts.  A similar workflow would be
applicable to any collection of texts.  In this case there was metadata inside each text document,
that might not be the case in general making the process a little simpler. 

The results of the NER process is a collection of entity mentions.  This can be further processed
in a number of ways, as illustrated in other notebooks in this series.