# Tasks 

Get bibliography of authors with at least 10 publications from an archive of papers. You need to use Dask python library, that provides multi-variables through Dask.bag.

### Source
Data file is 4.2GB size, get from here: https://arcadauas-my.sharepoint.com/:u:/g/personal/akusokan_arcada_fi1/EYyh_LCY_1hBm6bZRn7MzpABAHmzUJO0su0jQZAOi8nB1w?e=bJISjG

 

Or from BDA lab: bdalab@bdalab-248:/home/data/articles.C-H.xml.tar.gz

Jupyter on BDA lab: 

./anaconda3/bin/jupyter notebook --no-browser
 

1. Create a multi-variable of texts by reading compressed TAR file directly
2. Extract author names for each paper
3. Make a new multi-variable with author name as a key and paper as a value
4. Make a new multi-variable with number of publications for each author (look for a suitable function in Dask.Bag module!)
5. Filter only authors with at least 10 publications
6. Get a list of those authors with their publication counts as your final result (a normal list, not a multi-variable)

In [2]:
# Packages
import dask
from dask import bag
import tarfile
from lxml import etree

In [3]:
path= "C:/Users/Shohag/Desktop/phython/articles.C-H.xml.tar.gz"

In [4]:
# Reader : Compressed TAR file

def get_one_author(element):
    name = element.find(".//given-names").text
    surname = element.find(".//surname").text
    return name+" "+surname
    
def reader(N=100):
    """Reads all authors from at most N files.
    """
    i = 0
    with tarfile.open(path, mode="r:gz") as tf:
        for item in tf:
            file = tf.extractfile(item)
            if file is None:
                continue   # was a folder
            try:    
                doc = etree.parse(file)
                authors_list = [get_one_author(a) for a in doc.findall(".//contrib") if "author" in a.values()]
            
                for author in authors_list:
                    
                    yield (item.name, author)
            except:
                continue

            i += 1
            if i >= N:
                break

In [5]:
MV = bag.from_sequence(reader(50000))

# Multivariable with author as a key and paper as a value
def get_author_name_paper(x):
    paper, key = x
    _, paper = paper.split(r'/')
    paper, _ = paper.split(r'.nxml')
    return (key, paper)

In [6]:
MV2 = MV.map(get_author_name_paper)
MV2.take(1)[0]

('Thomas L. Reichmann', 'CALPHAD_2014_Dec_47_56-62')

In [7]:
def make_dic(tup):
    tups = {'key': tup[0], 'value':tup[1]}
    return tups 

In [8]:
MV3 = MV2.map(make_dic)
MV3.take(1)[0]

{'key': 'Thomas L. Reichmann', 'value': 'CALPHAD_2014_Dec_47_56-62'}

In [9]:
#List of authors of at least 10 publications and number of their publications
def add(x, y = 1):
    return x + 1
MV4 = MV3.foldby('key', add, 0)
MV4.take(1)

(('Thomas L. Reichmann', 2),)

In [10]:
authors_10= list(MV4.filter(lambda x: x[1] >=10))
authors_10

[('Kimberly D. Tanner', 15),
 ('Louisa A. Stark', 10),
 ('Erin L. Dolan', 16),
 ('Deborah Allen', 10),
 ('Bruce F. Walker', 19),
 ('Robert W. Derlet', 11),
 ('David B. Hogan', 10),
 ('Kenneth Rockwood', 12),
 ('R. Enns', 12),
 ('Louis Valiquette', 12),
 ('Kevin B Laupland', 10),
 ('Adeera Levin', 11),
 ('Jason Nickerson', 13),
 ('Paul Dent', 19),
 ('Yangqiu Li', 10),
 ('R H Reznek', 13),
 ('Marij J. P. Welters', 11),
 ('Sjoerd H. van der Burg', 11),
 ('Simon Sherman', 10),
 ('Eun-Cheol Park', 10),
 ('Kyu-Won Jung', 13),
 ('Hyun-Joo Kong', 11),
 ('Young-Joo Won', 13),
 ('Tae Min Kim', 12),
 ('Peter Bramlage', 10),
 ('Enrique Z Fisman', 13),
 ('Alexander Tenenbaum', 16),
 ('Jim A. Reekers', 11),
 ('Bongani M Mayosi', 16),
 ('Karen Sliwa', 21),
 ('J Moodley', 10),
 ('Okechukwu S Ogah', 11),
 ('Simon Stewart', 10),
 ('IB Diop', 10),
 ('--- ---', 12),
 ('Naoki Oiso', 10),
 ('Akira Kawada', 10),
 ('Sadanori Furudate', 22),
 ('Akira Hashimoto', 10),
 ('Yumi Kambayashi', 19),
 ('Aya Kakizaki',