In [1]:
from proquest_xml import ProquestXml, create_dataframe

## Read XML and get basic details

In [2]:
pq_xml = ProquestXml('data/hr_sample/868650799.xml')

print(pq_xml.id, pq_xml.get_article_title())

868650799 Million-dollar refugee family caught in perpetual detention


## Read multiple documents and create a dataframe

In [3]:
import glob

filenames = glob.glob('data/hr_sample/*.xml')[:5]
docs = [ProquestXml(f) for f in filenames]
df = create_dataframe(docs)

df

Unnamed: 0,id,title,date_published,publication,author_last_name,author_first_name,article_type,text
0,879751809,Army officer claims stutter discrimination - E...,2011-07-29,The Australian,Callinan,Rory,News,A STUTTERING army officer has lodged a discri...
1,908016581,Aboriginal call for intervention on health nig...,2011-12-06,The Australian,Puddy,Rebecca,News,HUNDREDS of different laws across the country...
2,881503793,Amnesty says policy forcing Aborigines off the...,2011-08-09,Sydney Morning Herald,,,News,GOVERNMENT policies are driving Australian Ab...
3,914325855,There will be women in foxholes,2012-01-07,Sydney Morning Herald,Snow,Deborah,News,"""I was one of the first women to do the job I..."
4,904162894,Afghan wins hearing into deportation,2011-11-17,Sydney Morning Herald,Needham,Kirsty,News,A CHALLENGE to the first forced deportation o...


You can include extra fields in the dataframe by specifying their path in the nested structure:

In [13]:
extra_df = create_dataframe(
    [pq_xml] + docs, # Some docs don't have the subject field so adding in pq_xml
    extra_fields={
        'language': 'Obj/Language',
        'terms': 'Obj/Terms/GenSubjTerm/GenSubjValue'
    })

extra_df

Unnamed: 0,id,title,date_published,publication,author_last_name,author_first_name,article_type,text,language,terms
0,868650799,Million-dollar refugee family caught in perpet...,2011-05-27,The Australian,Maley,Paul,News,THEY are Australia's most expensive refugees....,"[{'RawLang': 'English'}, {'@IsPrimary': 'true'...",Immigration policy
1,879751809,Army officer claims stutter discrimination - E...,2011-07-29,The Australian,Callinan,Rory,News,A STUTTERING army officer has lodged a discri...,"[{'RawLang': 'English'}, {'@IsPrimary': 'true'...",
2,908016581,Aboriginal call for intervention on health nig...,2011-12-06,The Australian,Puddy,Rebecca,News,HUNDREDS of different laws across the country...,"[{'RawLang': 'English'}, {'@IsPrimary': 'true'...",
3,881503793,Amnesty says policy forcing Aborigines off the...,2011-08-09,Sydney Morning Herald,,,News,GOVERNMENT policies are driving Australian Ab...,"[{'RawLang': 'English'}, {'@IsPrimary': 'true'...",
4,914325855,There will be women in foxholes,2012-01-07,Sydney Morning Herald,Snow,Deborah,News,"""I was one of the first women to do the job I...","[{'RawLang': 'English'}, {'@IsPrimary': 'true'...",
5,904162894,Afghan wins hearing into deportation,2011-11-17,Sydney Morning Herald,Needham,Kirsty,News,A CHALLENGE to the first forced deportation o...,"[{'RawLang': 'English'}, {'@IsPrimary': 'true'...",


## Other features

Allows item lookup like a dictionary:

In [5]:
pq_xml['GOID']

'868650799'

Uses the `dpath` module to allow easy access into nested dictionaries and some basic searching:

In [10]:
pq_xml.get('DFS/PubFrosting/Title')

'The Australian'

In [11]:
pq_xml.search('DFS/PubFrosting/*Title*')

{'DFS': OrderedDict([('PubFrosting',
               OrderedDict([('Title', 'The Australian'),
                            ('SortTitle', 'Australian The'),
                            ('CurrentTitle',
                             OrderedDict([('Title', 'The Australian'),
                                          ('SortTitle', 'Australian The'),
                                          ('Qualifier', 'Canberra, A.C.T.'),
                                          ('EndIssueDate', '99991231'),
                                          ('Locators',
                                           OrderedDict([('Locator',
                                                         [OrderedDict([('@Type',
                                                                        'PQPMID'),
                                                                       ('Name',
                                                                        '55714')]),
                                                      

Get the main article text (and remove HTML tags):

In [17]:
print(pq_xml.get_text()[:110])

 THEY are Australia's most expensive refugees. The Rahavans, a Sri Lankan family of five, most of whom journey


See the full nested structure with `show_all_keys()`, e.g.:

```
GOID
Obj
  SourceRollupType
  ObjectTypes
    other
      @ObjectTypeOrigin
      #text
    mstar
  ObjectRollupType
  TitleAtt
    Title
```