Nikhil Das Karavatt
This report documents the analysis of a large DBLP dataset, focusing on various aspects such as authors, references, venues, and more. The project explores the domains, applications, and challenges related to Python and its libraries, particularly in the field of graph analysis and visualization.
The project involved uploading the DBLP dataset to Google Drive, reading the data in Google Colab in chunks, preprocessing the data, and performing graph analysis. Three main graphs were created:
- Known Author Graph: An undirected graph connecting authors related to each other.
- Paper Citation Graph: A directed graph connecting paper IDs with their references.
- Author Venue Graph: An undirected graph connecting authors to the venues where they presented their papers.
The project was completed individually, taking approximately 60 hours of work to finish.
- Self-node Issue: While creating edges for the Known Author Graph, a self-node issue (A1, A1) was encountered, which was fixed by refining the edge creation code and using a predefined function to remove self-loops.
- Graph Layout: Graph layout overlapping was resolved by adjusting parameters to improve graph readability.
- Data Cleaning: Cleaning data with NaN values and empty sets required careful handling to avoid unnecessary row drops.
A sample size of 20,000 was used to calculate graph characteristics for Known Author Graph, Paper Citation Graph, and Author Venue Graph:
Known Author Graph:
- Number of Nodes: 50,436
- Number of Edges: 90,153
- Density: 7.0882e-05
- Number of Connected Components: 11,064
- Diameter: -1
- Minimum Degree: 1
- Maximum Degree: 53
- Average Degree: 3.5749
- Std Dev of Degree: 3.3221
Paper Citation Graph:
- Number of Nodes: 133,329
- Number of Edges: 155,457
- Density: 1.7490e-05
- Number of Connected Components: 3,613
- Diameter: -1
- Minimum Degree: 1
- Maximum Degree: 153
- Average Degree: 2.3319
- Std Dev of Degree: 4.3629
Author Venue Graph:
- Number of Nodes: 8,805
- Number of Edges: 6,646
- Density: 0.0001715
- Number of Connected Components: 2,168
- Diameter: -1
- Minimum Degree: 1
- Maximum Degree: 19
- Average Degree: 1.5096
- Std Dev of Degree: 1.1879
Comparison between manually calculated and coded graph characteristics for Known Author Graph, Paper Citation Graph, and Author Venue Graph.
Identification of the most common clique size in the Known Author Graph.
Identification of the top 10 most cited papers in the Paper Citation Graph.
The report includes ground truth information for Known Author Graph and manual calculations for several graph characteristics.