Skip to content

Hadoop map reduce implementation for describing author's publication and their relationship with other authors (Co-authors).

Notifications You must be signed in to change notification settings

Mohammed-siddiq/Authors-Relationship

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

Created a map/reduce program for parallel processing of the publically available DBLP dataset that contains entries for various publications at many different venues (e.g., conferences and journals).

Each entry in the dataset describes a publication, which contains the list of authors, the title, and the publication venue and a few other attributes. The file is approximately 2.5Gb.

Please refer this short video for demonstration of the running project on EMR.

Graph generated representing the the entire UIC CS faculty researchers and thier work

Graph of UIC CS faculty Researchers

(Pro tip : Download and Zoom in for the details.)

Consider the following entry in the dataset.

<inproceedings mdate="2017-05-24" key="conf/icst/GrechanikHB13">
<author>Mark Grechanik</author>
<author>B. M. Mainul Hossain</author>
<author>Ugo Buy</author>
<title>Testing Database-Centric Applications for Causes of Database Deadlocks.</title>
<pages>174-183</pages>
<year>2013</year>
<booktitle>ICST</booktitle>
<ee>https://doi.org/10.1109/ICST.2013.19</ee>
<ee>http://doi.ieeecomputersociety.org/10.1109/ICST.2013.19</ee>
<crossref>conf/icst/2013</crossref>
<url>db/conf/icst/icst2013.html#GrechanikHB13</url>
</inproceedings>

This entry lists a paper at the IEEE International Conference on Software Testing, Verification and Validation (ICST) published in 2013 whose authors are my former Ph.D. student at UIC, now tenured Associate Professor at the University of Dhaka, Dr. B.M. Mainul Hussain whose advisor Mark Grechanik is a co-author on this paper. The third co-author is Prof.Ugo Buy, a faculty member at our CS department. The presence of two authors, Mark Grechanik and Ugo Buy in a single publication like this one establishes a connection between these faculty members. Your job is to create a "friendship" connectivity graph between UIC CS faculty members using the information extracted from this dataset. Paritioning this dataset into shards is easy, since it requires to preserve the well-formedness of XML only. Most likely, you will write a simple program to partition the dataset into an approximately equal size shards.

After creating and testing this map/reduce program locally, The job was deployed and run on the Amazon Elastic MapReduce (EMR).

Steps to RUN on VM :

  • Run sbt clean assembly , which will run the test cases and run generate the JAR
  • Copy the jar on the machine.(Machine needs to have java 1.8)
  • Run the following command :
hadoop jar Mohammed_Siddiq_HW3-assembly-0.1.jar RelateAuthors.JobRunner path/to/input/file path/to/output/file

Steps to RUN on Cloud :

Solution Overview

  • Implemented XmlInputFormat that implements the DataRecordReader to generated Key,values based on the multiple start tags and end tags. Start and end Tag considered

START_TAGS

<article ,<inproceedings ,<proceedings ,<book ,<incollection ,<phdthesis ,<mastersthesis "`

END_TAGS

</article>, </inproceedings>,</proceedings>,</book>,</incollection>,</phdthesis>,</mastersthesis>
  • The Mapper :

    • processes the individual papers/publications given as xml by InputFormat
    • Extracts the authors and filters the CS authors of UIC.
    • Emits the Individual authors as keys and 1 as the value .
    • Also emits the co-authors as the key and 1 as the value.
    • Signifying authors work alone and with the co-authors from the CS faculty of UIC.

    For example:

    suppose (a1,a2,a3) are authors of a paper then the mapper emits:

      (a1)->1,(a2)->1,(a3)->1,(a1,a2)->1,(a1,a3)->1,(a2,a3)->1
    
  • The Combiner and Reducer :

    • The combiner and reducer adds all the corresponding values of the keys, thereby summing up all the number of publications of individual authors and the co-authors.

The sample output of the mapper would look like this :

  a. prasad sistla	130
  a. prasad sistla,isabel f. cruz	2
  a. prasad sistla,lenore d. zuck	7
  a. prasad sistla,ouri wolfson	22
  a. prasad sistla,robert h. sloan	1
  a. prasad sistla,v. n. venkatakrishnan	8
  ajay d. kshemkalyani	113
  ajay d. kshemkalyani,ugo buy	1
  anastasios sidiropoulos	101
  anastasios sidiropoulos,bhaskar dasgupta	1
  andrew e. johnson	102
  andrew e. johnson,barbara di eugenio	2
  andrew e. johnson,luc renambot	26
  andrew e. johnson,tanya y. bergerwolf	1
  balajee vamanan	15
  barbara di eugenio	116
  barbara di eugenio,bing liu	1
  barbara di eugenio,brian d. ziebart	1
  barbara di eugenio,isabel f. cruz	2
  barbara di eugenio,luc renambot	1
  barbara di eugenio,ouri wolfson	2
  barbara di eugenio,peter c. nelson	3
  bhaskar dasgupta	142
  bhaskar dasgupta,nasim mobasheri	9
  bhaskar dasgupta,ouri wolfson	5
  bhaskar dasgupta,robert h. sloan	2
  bhaskar dasgupta,tanya y. bergerwolf	12

This final output directory of the reducers is given to the Graph Visualization tool written in graphviz to generate an PNG image representing friendship graph between professors. Individual nodes represent the CS faculty and the Edges between them represent the friendship(co-publishers).

The weights associated with the edges represent the number of times the they published together. The weights of the individual nodes represent the total number of publication of the author.

Link to GraphViz implementation to generate the DOT file from the output and present the graph

About

Hadoop map reduce implementation for describing author's publication and their relationship with other authors (Co-authors).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published