In [1]:
import pandas as pd
import numpy as np
import glob
import nltk

from proquest_xml import ProquestXml, create_dataframe, filter_company_reports, enter_query, concordance_dataframe

### Read in XML corpora and create a dictionary

In [5]:
filenames = glob.glob('path/to/file/*.xml')

docs = {doc.id: doc for doc in 
        (ProquestXml(f) for f in filenames)}

### Filter out 'Company Data Report' XML files

In [6]:
docs = filter_company_reports(docs)

### Create dataframe from filtered dictionary

In [7]:
df = create_dataframe(docs.values())
df.head(5)

Unnamed: 0,id,title,date_published,publication,author1_last_name,author1_first_name,author1_full_name,other_authors,article_type,text
0,2390914166,"Osceola anuncia plan de recuperación, nuevo si...",2020-04-17,TCA Regional News,Cotto,Ingrid,Ingrid Cotto,[],News,\nApril 17--El gobierno de Osceola intenta vol...
1,2390939976,Virtual GRADUCon discussions examine impact of...,2020-04-16,University Wire,Tessel,Michael,Michael Tessel,[],News,"\nPublication: The Pulse, , Finch University o..."
2,2390931455,Se posterga Asamblea General del Patronato de ...,2020-04-17,NOTIMEX,,,,[],News,"\nPor Guillermo Abogado González\n\nMéxico, 17..."
3,2390919681,Cierran por tercer día consecutivo primer cuad...,2020-04-17,NOTIMEX,,,,[],News,"\nPlaya del Carmen, 17 Abr (Notimex).- La Secr..."
4,2390914548,Andhra govt politicising COVID-19 situation : ...,2020-04-18,Asian News International,,,,[],News,"\n\nHyderabad (Telangana) [India], April 18 (A..."


### Concordance tool

There are two steps for creating a concordance of the corpus above.

1. Specify a list of words to query the text of the corpus using enter_query(). When prompted, enter a list of words separated by commas, e.g. rights,liberties,freedoms. The function also accepts strings made up of multiple words, which must be separated by spaces, e.g. rights,human rights,bill of rights,liberties. Click "enter" when satisfied with the list of words to have Python store the words as a list.

2. Apply the list of words to the corpus to generate a dataframe containing the query and the surrounding words.

In [8]:
# Create a list of words

word_list = enter_query()

Please enter the words you wish to query separated by commas (e.g. rights,freedoms,liberties). If you wish to query multiple words at the same time, separate them using a space (e.g. freedoms,human rights,liberties)rights,human rights,freedom,freedoms,liberties


In [9]:
# Check that the list has been correctly generated

word_list

['rights', 'human rights', 'freedom', 'freedoms', 'liberties']

In [10]:
# Query the dataframe of the XML corpus with the list

dataframe = concordance_dataframe(df, word_list)

In [11]:
dataframe

Unnamed: 0,id,title,date_published,publication,author1_last_name,author1_first_name,other_authors,article_type,left,query,right
20,2390914548,Andhra govt politicising COVID-19 situation : ...,2020-04-18,Asian News International,,,[],News,irregularities in the distribution of grocerie...,rights,reserved . Provided by SyndiGate Media Inc. ( ...
90,2390914605,AllianceBernstein Holding climbs 4.8%,2020-04-17,News Bites US - NYSE,,,[],News,to read in the press an inaccurate and unfound...,rights,"of the Company and its management , despite th..."
95,2390914188,"In regular touch with centre, states to bring ...",2020-04-18,Asian News International,,,[],News,"from 4,000 tests per day to 10,000 tests per d...",rights,reserved . Provided by SyndiGate Media Inc. ( ...
190,2390914102,Combating COVID-19: Laboratories should speed ...,2020-04-18,Asian News International,,,[],News,"Pradesh , including 74 cured and discharged an...",rights,reserved . Provided by SyndiGate Media Inc. ( ...
225,2390914275,"Champagne, Ukraine reject Iran crash report su...",2020-04-16,National Post (Online),,,[],News,to sign a memorandum of understanding in which...,rights,for legal compensation . Iran 's Revolutionary...
226,2390914275,"Champagne, Ukraine reject Iran crash report su...",2020-04-16,National Post (Online),,,[],News,"that Canada , Ukraine or any of the other coun...",rights,to hold Iran to account . `` That would not be...
241,2390914075,Arbor Realty Trust soars 11.7% on average volume,2020-04-17,News Bites US - NYSE,,,[],News,compared to $ 14.2 million and 1.54 % for the ...,rights,"was $ 29.9 million for the quarter , reflectin..."
251,2390914060,Ashland Global Holdings up 6.2% in 2 days,2020-04-17,News Bites US - NYSE,,,[],News,"Tender Offers may no longer be withdrawn , exc...",rights,are required by law . January 22 : Ashland ann...
281,2390914737,"Brazil's coronavirus death toll surpass 2,000",2020-04-18,Asian News International,,,[],News,"distancing , resulting in his removal from off...",rights,reserved . Provided by SyndiGate Media Inc. ( ...
301,2390914681,Annaly Capital Management climbs 5.5%,2020-04-17,News Bites US - NYSE,,,[],News,". On the Redemption Date , dividends on the Se...",rights,relating to the Series C Preferred Stock will ...


### Export dataframe to Excel file

In [16]:
dataframe.to_excel("your_file_name.xlsx")