In [1]:
from proquest_xml import ProquestXml, create_dataframe

## Read XML and get basic details

In [2]:
pq_xml = ProquestXml('data/hr_sample/868650799.xml')
print(pq_xml)
print(pq_xml.id, pq_xml.get_article_title(), sep=';')

ProquestXml(id=868650799, title='Million-dollar ...')
868650799;Million-dollar refugee family caught in perpetual detention


## Read multiple documents and create a dataframe

In [3]:
import glob

filenames = glob.glob('data/hr_sample/*.xml')
docs = {doc.id: doc for doc in 
        (ProquestXml(f) for f in filenames)}
df = create_dataframe(docs.values())

print(df.shape)
df.head()

(300, 10)


Unnamed: 0,id,title,date_published,publication,author1_last_name,author1_first_name,author1_full_name,other_authors,article_type,text
0,879751809,Army officer claims stutter discrimination - E...,2011-07-29,The Australian,Callinan,Rory,RORY CALLINAN,[],News,A STUTTERING army officer has lodged a discri...
1,908016581,Aboriginal call for intervention on health nig...,2011-12-06,The Australian,Puddy,Rebecca,REBECCA PUDDY,[],News,HUNDREDS of different laws across the country...
2,881503793,Amnesty says policy forcing Aborigines off the...,2011-08-09,Sydney Morning Herald,Lester,Tim,Tim Lester,"[{'order': '2', 'last_name': 'Willingham', 'fi...",News,GOVERNMENT policies are driving Australian Ab...
3,914325855,There will be women in foxholes,2012-01-07,Sydney Morning Herald,Snow,Deborah,Deborah Snow,[],News,"""I was one of the first women to do the job I..."
4,904162894,Afghan wins hearing into deportation,2011-11-17,Sydney Morning Herald,Needham,Kirsty,Kirsty Needham,[],News,A CHALLENGE to the first forced deportation o...


Documents with multiple authors have additional authors
stored in a list:

In [4]:
df['other_authors'].map(len).value_counts()

0    287
1     10
3      2
2      1
Name: other_authors, dtype: int64

### Adding additional data

You can include extra fields in the dataframe by specifying their path in the nested structure, or providing a function that pulls them out of the object (either an existing method like `get_terms`, or a custom function):

In [5]:
def get_language(doc):
    lang_info = doc.get('Obj/Language')
    languages = []
    for entry in lang_info:
        if 'RawLang' in entry:
            languages.append(entry['RawLang'])
    return languages

extra_df = create_dataframe(
    [pq_xml] + list(docs.values()), # Some docs don't have the subject field so adding in pq_xml
    extra_fields={
        'copyright': 'Obj/Copyright/CopyrightData',
        'languages': get_language,
        'terms': ProquestXml.get_terms
    })

extra_df.head()

Unnamed: 0,id,title,date_published,publication,author1_last_name,author1_first_name,author1_full_name,other_authors,article_type,text,copyright,languages,terms
0,868650799,Million-dollar refugee family caught in perpet...,2011-05-27,The Australian,Maley,Paul,Paul Maley,[],News,THEY are Australia's most expensive refugees....,(c) News Limited Australia. All rights reserved.,[English],[Immigration policy]
1,879751809,Army officer claims stutter discrimination - E...,2011-07-29,The Australian,Callinan,Rory,RORY CALLINAN,[],News,A STUTTERING army officer has lodged a discri...,(c) News Limited Australia. All rights reserved.,[English],"[Human rights, Discrimination, Armed forces]"
2,908016581,Aboriginal call for intervention on health nig...,2011-12-06,The Australian,Puddy,Rebecca,REBECCA PUDDY,[],News,HUNDREDS of different laws across the country...,(c) News Limited Australia. All rights reserved.,[English],"[Native peoples, Intervention]"
3,881503793,Amnesty says policy forcing Aborigines off the...,2011-08-09,Sydney Morning Herald,Lester,Tim,Tim Lester,"[{'order': '2', 'last_name': 'Willingham', 'fi...",News,GOVERNMENT policies are driving Australian Ab...,(Copyright (c) 2011 Fairfax Media Publications...,[English],"[Native peoples, Infant mortality, Housing]"
4,914325855,There will be women in foxholes,2012-01-07,Sydney Morning Herald,Snow,Deborah,Deborah Snow,[],News,"""I was one of the first women to do the job I...",(Copyright (c) 2012 Fairfax Media Publications...,[English],"[Armed forces, Violence]"


## Other features

Allows item lookup like a dictionary:

In [6]:
pq_xml['GOID']

'868650799'

Uses the `dpath` module to allow easy access into nested dictionaries and some basic searching:

In [7]:
pq_xml.get('DFS/PubFrosting/Title')

'The Australian'

In [8]:
pq_xml.search('DFS/PubFrosting/*Title*')

{'DFS': OrderedDict([('PubFrosting',
               OrderedDict([('Title', 'The Australian'),
                            ('SortTitle', 'Australian The'),
                            ('CurrentTitle',
                             OrderedDict([('Title', 'The Australian'),
                                          ('SortTitle', 'Australian The'),
                                          ('Qualifier', 'Canberra, A.C.T.'),
                                          ('EndIssueDate', '99991231'),
                                          ('Locators',
                                           OrderedDict([('Locator',
                                                         [OrderedDict([('@Type',
                                                                        'PQPMID'),
                                                                       ('Name',
                                                                        '55714')]),
                                                      

Search anywhere for a tag

In [9]:
pq_xml.search_all_tags('title')

['/Obj/TitleAtt',
 '/Obj/TitleAtt/Title',
 '/DFS/PubFrosting/Title',
 '/DFS/PubFrosting/SortTitle',
 '/DFS/PubFrosting/CurrentTitle',
 '/DFS/PubFrosting/CurrentTitle/Title',
 '/DFS/PubFrosting/CurrentTitle/SortTitle']

Search anywhere for a (string) value

In [11]:
pq_xml.search_all_values('news')

[('/Obj/SourceRollupType', 'Newspapers'),
 ('/Obj/ObjectTypes/other/#text', 'News'),
 ('/Obj/ObjectTypes/mstar', 'News'),
 ('/Obj/Copyright/CopyrightData',
  '(c) News Limited Australia. All rights reserved.'),
 ('/DFS/PubFrosting/SourceType', 'Newspapers'),
 ('/DFS/PubFrosting/publisher/PublisherName', 'News Limited'),
 ('/DFS/PubFrosting/publisher/URL', 'www.NewsCorpAustralia.com')]

Get the main article text (and remove HTML tags):

In [12]:
print(pq_xml.get_text()[:110])

 THEY are Australia's most expensive refugees. The Rahavans, a Sri Lankan family of five, most of whom journey


See the full nested structure of the tags with `pq_xml.show_all_tags()`, e.g.:

```
GOID
Obj
  SourceRollupType
  ObjectTypes
    other
      @ObjectTypeOrigin
      #text
    mstar
  ObjectRollupType
  TitleAtt
    Title
```