add Fraser example reading local text files

HASSCloud · Dec 10, 2018 · b03d6ea · b03d6ea
1 parent f0c6f92
commit b03d6ea
Show file tree

Hide file tree

Showing 5 changed files with 378 additions and 59 deletions.
diff --git a/Digivol NER.ipynb b/Digivol NER.ipynb
@@ -135,6 +135,15 @@
     "locations"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "locations.to_csv(\"digivol-locations.csv\")"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,

diff --git a/Fraser Radio Talks.ipynb b/Fraser Radio Talks.ipynb
@@ -0,0 +1,283 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Malcolm Fraser Radio Talks \n",
+    "\n",
+    "This collection is available for download from the University of Melbourne: [Malcolm Fraser collection at the University of Melbourne Archives.](https://archives.unimelb.edu.au/explore/collections/malcolmfraser/explore/radiotalks) (at the bottom of that page there is a link to download the data in .txt format).  The data is distributed\n",
+    "as a zip file containing multiple text files, one for each speech.  This notebook demonstrates how to read this \n",
+    "data into Python and apply a Named Entity Recognition pipeline to it.\n",
+    "\n",
+    "First we load the required modules."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import spacy\n",
+    "import csv\n",
+    "import geocoder\n",
+    "import pandas as pd\n",
+    "import zipfile\n",
+    "from urllib.request import urlopen\n",
+    "import utils\n",
+    "%load_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model = 'en_core_web_md'\n",
+    "#spacy.cli.download(model)\n",
+    "nlp = spacy.load(model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The data will be stored in the data directory, the folowing code downloads the zip file and unpacks \n",
+    "it into the data directory.  Once you run this you can browse the [data](data/UMA_Fraser_Radio_Talks) directory to see the files.  This is a good example of handling data in a notebook that you want to share.  We can't re-publish\n",
+    "the data but we can provide a link and the code to download and prepare the data for analysis. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataurl = \"https://archives.unimelb.edu.au/__data/assets/text_file/0006/1717746/UMA_Fraser_Radio_Talks.zip\"\n",
+    "datafile = 'data/UMA_Fraser_Radio_Talks.zip'\n",
+    "with urlopen(dataurl) as response:\n",
+    "    data = response.read()\n",
+    "    with open(datafile, 'wb') as out:\n",
+    "        out.write(data)\n",
+    "\n",
+    "with zipfile.ZipFile(datafile, 'r') as zip_ref:\n",
+    "    zip_ref.extractall('data')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Since we want to work on the files as a collection, we need to get a list of the files to process. The python `os` module provides `listdir` which gives us a list of filenames in a given directory. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "datadir = 'data/UMA_Fraser_Radio_Talks/'\n",
+    "files = os.listdir(datadir)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To begin working with this data we need to take a look at it.  I can open individual files in a text\n",
+    "editor but I can also take a look in this notebook to see what the contents are.  This next cell reads\n",
+    "the text in the first file in our list and prints it.  From this we can see that there is a metadata section\n",
+    "at the start of each file with a few fields that might be of interest.  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "samplefile = files[0]\n",
+    "with open(os.path.join(datadir, samplefile)) as fd:\n",
+    "    text = fd.read()\n",
+    "print(text[:500])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Given this structure we can write a function to read a single file from this collection and strip off\n",
+    "the metadata section.   We will have the code parse the metadata into fields and values.   The result of this\n",
+    "function will be a dictionary representing the file with fields for the metadata and for the text.  We've\n",
+    "also added a field containing the filename as an identifier for each text.  \n",
+    "\n",
+    "This function is very specific to this file format but similar code could be used for other formats.  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def read_fraser_text(filename):\n",
+    "    \"\"\"Read a file from the Malcolm Fraser collection\n",
+    "    Return a dictionary with metadata and fields for the text of the file\"\"\"\n",
+    "\n",
+    "    # define the initial dictionary\n",
+    "    meta = {\n",
+    "            'text': \"\",\n",
+    "            'filename': filename\n",
+    "           }\n",
+    "    # now read the file using a utf-8 encoding and ignore any errors (usually wierd characters)\n",
+    "    with open(os.path.join(datadir, filename), encoding='utf-8', errors='ignore') as fd:\n",
+    "\n",
+    "        inheader = True  # flag that is True until we finish reading the header lines\n",
+    "        for line in fd.readlines():\n",
+    "            if inheader:\n",
+    "                # if we are in the header, try to extract the metadata from fields that don't start with <!\n",
+    "                if not line.startswith('<!'):\n",
+    "                    words = line.split(':')\n",
+    "                    meta[words[0]] = \":\".join(words[1:]).strip()\n",
+    "                if line.startswith(\"<!--end metadata-->\"):\n",
+    "                    # end of the header\n",
+    "                    inheader = False\n",
+    "            else:\n",
+    "                # add this line to the text, note that we strip off newlines and \n",
+    "                # add a space to the line, this cleans it a bit for spacy\n",
+    "                meta['text'] += line.strip() + \" \"\n",
+    "    \n",
+    "    return meta\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can now apply this function to all of the filenames and collect the result in a list, then convert that to \n",
+    "a Pandas dataframe for later processing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data = [read_fraser_text(fn) for fn in files]\n",
+    "fraser = pd.DataFrame(data)\n",
+    "fraser.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Named Entity Recognition\n",
+    "\n",
+    "Now that we have the data in a standard form, we can apply the NER process to the text.   The utility function\n",
+    "takes the data frame we created `fraser` and the name of the column containing the text and that containing\n",
+    "the identifier. The result is a new dataframe containing the entities recognised in the text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "entities = utils.apply_ner(fraser, textcol='text', ident='filename')\n",
+    "entities.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Given these entities we can select just the `GPE` entities - the names of places.  We look at the shape of\n",
+    "this dataframe to see how many of these there are. The next cell then generates a table that counts the\n",
+    "number of occurences of each placename in the text and shows the top 30 places. \n",
+    "\n",
+    "As would be expected, Fraser talks a lot about Australia and the States.  The US and the Commonwealth are the\n",
+    "most common international mentions. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "locations = entities[entities.type==\"GPE\"]\n",
+    "print(locations.shape)\n",
+    "locations.entity.value_counts()[:30]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can do the same exercise for the names of people in the text to give an indication of who Fraser was talking about.  Note the errors creeping in here with Canberra and Viet Cong recognised as person names.  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "people = entities[entities.type==\"PERSON\"]\n",
+    "print(people.shape)\n",
+    "people.entity.value_counts()[:20]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summary\n",
+    "\n",
+    "This notebook has illustrated the process of reading a collection of text documents into Python \n",
+    "and running a Named Entity Recognition process over the texts.  A similar workflow would be\n",
+    "applicable to any collection of texts.  In this case there was metadata inside each text document,\n",
+    "that might not be the case in general making the process a little simpler. \n",
+    "\n",
+    "The results of the NER process is a collection of entity mentions.  This can be further processed\n",
+    "in a number of ways, as illustrated in other notebooks in this series."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}