# SC4022 - Group Project
## Network-Science Based Analysis of Collaboration Network of Data Scientists
### Contributors:
- Cholakov Kristiyan Kamenov
- Chua Wee Siang, Fraser
- Dhanyamraju Harsh Rao

## 0. Importing Libraries
First, we will import the necessary libraries for the whole project.

In [213]:
# Importing the libraries
import os  # Used for file operations
import requests  # Used for making HTTP requests
from bs4 import BeautifulSoup  # Used for parsing HTML
import re  # Used for regular expressions
import pandas as pd  # Used for data manipulation and analysis
import numpy as np  # Used for numerical computations
from tqdm import tqdm  # Used for progress bars
import time  # Used for time operations
import xml.etree.ElementTree as ET  # Used for parsing XML
from ipytree import Tree, Node  # Used for displaying the XML tag hierarchy in a dynamic tree-like structure
import json  # Used for JSON operations (used for the XML tag hierarchy)
import ast # Used for converting string to list

We will also set a random seed for reproducibility.

In [69]:
# Set a random seed for reproducibility
np.random.seed(42)

## 1. Data Collection
In this part, we will be collecting the data from https://dblp.org (Computer Science Bibliography) connected to the given data scientists in DataScientists.xls file. For the sake of standardization and ease of use, we will first convert the xls file to csv file.

In [70]:
# Check if the Excel file is converted to a csv file
if 'scientists_collab.csv' in os.listdir():
    print('The CSV file already exists')
else:
    print('The Excel file has not been converted to a CSV file')
    # Read the Excel file into a DataFrame
    df = pd.read_excel('DataScientists.xls')
    # Convert the DataFrame to a CSV file
    df.to_csv('scientists_collab.csv', index=False)
    print('The Excel file has been converted to a CSV file')

# Read the converted CSV file into a DataFrame
initial_df = pd.read_csv('scientists_collab.csv')

The CSV file already exists


### 1.1 Data Analysis on the Given Data
Before we start collecting the data, we will have a look at the given data to understand the structure of the data.

In [71]:
# Displaying the number of rows and columns in the DataFrame
print(f'The DataFrame has {initial_df.shape[0]} rows and {initial_df.shape[1]} columns')

# Displaying the first 5 rows of the DataFrame
initial_df.head(5)

The DataFrame has 1220 rows and 5 columns


Unnamed: 0,name,country,institution,dblp,expertise
0,aaron elmore,united states,university of chicago,https://dblp.org/pers/e/Elmore:Aaron_J=.html,
1,abdalghani abujabal,germany,amazon alexa,https://dblp.org/pers/a/Abujabal:Abdalghani.html,
2,abdul quamar,united states,ibm research almaden,https://dblp.org/pers/q/Quamar:Abdul.html,
3,abdulhakim qahtan,netherlands,utrecht university,https://dblp.org/pid/121/4198.html,
4,abhijnan chakraborty,germany,max planck institute for software systems,https://dblp.org/pers/c/Chakraborty:Abhijnan.html,


As we can see, the data contains the names of the data scientists, their countries and institutions. Also, the data contains the link to the DBLP page of the data scientist. We will later use this link to collect the data for each data scientist. But let's first perform some simple data analysis on the initial data.

In [72]:
# Display the number of unique values for each column
print('Unique values for each column:')
for column in initial_df.columns:
    print(f'The column "{column}" has {initial_df[column].nunique()} unique values')
    
# Display the number of duplicated names
print(f'\nThere are {initial_df["name"].duplicated().sum()} duplicated names')

# Print 3 examples of row pairs with duplicated names in a string format
print('\nExamples of duplicated names:')
# Get a random sample of 3 duplicated names
for name in np.random.choice(initial_df['name'][initial_df['name'].duplicated()].unique(), 3):
    print(f'Name: {name}')
    # Print the rows with the duplicated name
    for index in initial_df[initial_df['name'] == name].index:
        print(f'-Country: {initial_df.loc[index, "country"]}, Institution: {initial_df.loc[index, "institution"]}, Link: {initial_df.loc[index, "dblp"]}')

# Display the number of missing values for each column
print('\nMissing values for each column:')
for column in initial_df.columns:
    print(f'The column "{column}" has {initial_df[column].isnull().sum()} missing values')
    

Unique values for each column:
The column "name" has 1072 unique values
The column "country" has 44 unique values
The column "institution" has 704 unique values
The column "dblp" has 1079 unique values
The column "expertise" has 0 unique values

There are 148 duplicated names

Examples of duplicated names:
Name: tingjian ge
-Country: united states, Institution: university of massachusetts at lowell, Link: https://dblp.uni-trier.de/pers/g/Ge:Tingjian.html
-Country: united states, Institution: university of massachusetts, lowell, Link: https://dblp.uni-trier.de/pers/g/Ge:Tingjian.html
Name: sheng wang
-Country: china, Institution: alibaba group, Link: https://dblp.org/pid/85/1868-11.html
-Country: united states, Institution: new york university, Link: https://dblp.org/pid/85/1868.html
-Country: china, Institution: wuhan university, Link: https://dblp.org/pid/85/1868-7.html
Name: byron choi
-Country: hong kong sar, Institution: hong kong baptist university, Link: https://dblp.uni-trier.de

Looking at the simple data analysis, we can see that the data is fairly clean with only 3 missing values on the institution column. Also, we can observe that we have several duplicated names in the data. When we sample 3 examples of duplicated names and explore the rows with the duplicated names, we can see that the different rows present the same data scientist but with different institutions and countries. This is expected as the data scientists can work in different institutions and countries. We are sure the data scientist is the same as the rows have the same DBLP link. We can also see that the expertise column is empty (given as empty by the assignment).

### 1.2 PIDs and Final URLs Collection
Now, let's collect the PIDs and the Final Links (there may be some redirection using the links from the initial data) of the data scientists from the given data.

In [73]:
# Define the lists to hold the PIDs and the final URLs
pids = []
final_urls = []

# Define a variable to monitor for the links that caused errors
errors_links = []

# Check if we have already collected the PIDs and the final URLs
if 'scientists_pids_urls.csv' in os.listdir():
    print('The PIDs and the final URLs have already been collected')
else:
    # Iterate over the 'dblp' column in the initial DataFrame
    for link in tqdm(initial_df['dblp']):
        # Define an infinite loop to handle the Too Many Requests error
        response = None
        while True:
            # Try sending a GET request to the link
            response = requests.get(link)
    
            # If the status code is 429 (Too Many Requests), wait for 60 seconds before trying again
            if response.status_code == 429:
                print('Too many requests, sleeping for 60 seconds...')
                time.sleep(60)
            else:
                break
        
        # If the status code is not 200 (some Error in fetching the data), append the link to the errors list
        if response.status_code != 200:
            errors_links.append(link)
            pids.append('Error')
            final_urls.append('Error')
            continue
            
        # Get the final URL (after possible redirections)
        final_url = response.url
        # Define a regular expression pattern to extract the PID from the URL
        pattern = r'pid/(.*).html'
        # Find the PID in the URL
        match = re.search(pattern, final_url)
    
        # If the PID is found in the URL
        if match:
            # Extract the PID
            pid = match.group(1)
            # Replace '/' with '-' for better naming
            pid = pid.replace('/', '-')
            # Append the PID to the list with PIDs
            pids.append(pid)
            # Append the final URL to the list with final URLs
            final_urls.append(final_url)
        else:
            # Append an 'Error' to the list with PIDs to indicate that the PID was not found
            pids.append('Error')
            # Append an 'Error' to the list with final URLs to indicate that the final URL was not found
            final_urls.append('Error')
            # Append the link to the list with error links
            errors_links.append(link)

    # Check if there are any errors
    if len(errors_links) == 0:
        print('No errors occurred')
    else:
        print(f'The following links ({len(errors_links)}) caused errors:')
        for el in errors_links:
            print(el)

100%|██████████| 1220/1220 [24:04<00:00,  1.18s/it]

The following links (22) caused errors:
https://dblp.org/pid/39/1380.html
https://dblp.uni-trier.de/pers/c/Chakraborty:Anirban.html
https://dblp.org/pid/92/2769.html
https://dblp.org/pid/148/7268.html
https://dblp.uni-trier.de/pers/b/Barbosa:Denilson.html
https://dblp.org/pers/g/Georgakopoulos:Dimitrios.html
https://dblp.org/pid/161/0102.html
https://dblp.org/pers/m/Mansour:Essam.html
https://dblp.org/pers/m/Mansour:Essam.html
https://dblp.org/pid/284/0968.html
https://dblp.org/pid/98/5721.html
https://dblp.org/pers/j/Jung:Hyungsoo.html
https://dblp.uni-trier.de/pers/p/Petrov:Ilia.html
https://dblp.uni-trier.de/pers/p/Petrov:Ilia.html
https://dblp.org/pers/h/Hui:Kai.html
https://dblp.org/pers/w/Weidlich:Matthias.html
https://dblp.uni-trier.de/pers/hd/n/Nikolic:Milos
https://dblp.org/pers/z/Zhang:Ruqing.html
https://dblp.dagstuhl.de/pid/y/TingYu.html
https://dblp.dagstuhl.de/pid/y/TingYu.html
https://dblp.uni-trier.de/pers/t/Tao:Yufei.html
https://dblp.uni-trier.de/pers/t/Tao:Yufei.html




As we can see, we encountered some errors while fetching the PIDs and the final URLs. Let's look at the heading (h1) of the pages to which the links lead to understand the errors.

In [74]:
# Define a list to hold the headings of the pages
errors_links_headings = []

# Check if the cleaning of the errors has already been done
if 'scientists_pids_urls.csv' in os.listdir():
    print('The cleaning of the errors has already been done')
else:
    for el in errors_links:
        # Send a GET request to the error link
        response = requests.get(el)
        # Parse the HTML content
        soup = BeautifulSoup(response.content, 'html.parser')
        # Find the h1 tag
        h1 = soup.find('h1')
        # Append the heading to the list
        errors_links_headings.append(h1.text if h1 else 'No h1 tag found')
        
    # Display the unique headings of the pages
    print('Unique headings of the pages:')
    for heading in set(errors_links_headings):
        print('-' + heading)

Unique headings of the pages:
-Error 410: Gone
-Error 404: Not Found


The list of unique headings of the pages shows that the pages are not found (404 Not Found) or are gone (410 Gone). This is a common issue when fetching data from the web as the links may be outdated or the pages may have been removed. We will not be able to collect the data for these data scientists. Thus, we will remove these data scientists from the initial data.

In [75]:
# Check if the cleaning of the errors has already been done
if 'scientists_pids_urls.csv' in os.listdir():
    print('The cleaning of the errors has already been done')
else:
    # Create a new DataFrame by copying the initial DataFrame and adding the PIDs and final URLs
    pids_urls_df = initial_df.copy()
    pids_urls_df['pid'] = pids
    pids_urls_df['final_url'] = final_urls
    
    # Remove the rows where the PID is 'Error' OR the final URL is 'Error'
    pids_urls_df = pids_urls_df[(pids_urls_df['pid'] != 'Error') & (pids_urls_df['final_url'] != 'Error')]
    
    # Display the number of rows removed
    print(f'{initial_df.shape[0] - pids_urls_df.shape[0]} rows were removed because their links were broken.\n')
    
    # Drop the 'expertise' column as it is initially empty (will be filled later)
    pids_urls_df = pids_urls_df.drop(columns='expertise')
    
    # Save the DataFrame to a CSV file
    pids_urls_df.to_csv('scientists_pids_urls.csv', index=False)
    
    # Display the differance between the dimensions of the initial and the new DataFrame (from the CSV files)
    print('Dimensions Comparison:')
    print(f'New DataFrame: {pd.read_csv("scientists_pids_urls.csv").shape[0]} x {pd.read_csv("scientists_pids_urls.csv").shape[1]}')
    print(f'Initial DataFrame: {pd.read_csv("scientists_collab.csv").shape[0]} x {pd.read_csv("scientists_collab.csv").shape[1]}')

22 rows were removed because their links were broken.

Dimensions Comparison:
New DataFrame: 1198 x 6
Initial DataFrame: 1220 x 5


From now on, we will be using the newly created DataFrame with the PIDs and the final URLs (scientists_collab_pids_urls.csv) that we just cleaned from the broken links.

### 1.3 Data Cleaning
Now, we will explore the new cleaned data as there may still be some other issues with it.

In [76]:
# Check the number of duplicated PIDs
pids_urls_df = pd.read_csv('scientists_pids_urls.csv')
print(f'The number of duplicated PIDs is {pids_urls_df[pids_urls_df.duplicated(subset="pid", keep=False)].shape[0]}')

The number of duplicated PIDs is 278


As we can see, we still have the duplicated PIDs in the data. Now, we will clean the rows with the duplicated PIDs and keep the unique data from them (country, institution, etc.).

In [93]:
# Group the duplicated PIDs rows by the PID and aggregate the values
duplicated_pids = pids_urls_df[pids_urls_df.duplicated(subset='pid', keep=False)].groupby('pid').agg(lambda x: set(x))
# If the value for the column is a set of length = 1, get the first element of the set
for column in duplicated_pids.columns:
    duplicated_pids[column] = duplicated_pids[column].apply(lambda x: x.pop() if len(x) == 1 else x)

# Display the number of rows with column values as sets for each column
print('Number of rows with column values as sets for each column:')
for column in duplicated_pids.columns:
    print(f'-The column "{column}" has {duplicated_pids[duplicated_pids[column].apply(lambda x: isinstance(x, set))].shape[0]} rows with values as sets')
    
# Print the set values for the 'dblp' and 'final_url' columns
print('\nThe rows with set values in the "dblp" and "final_url" columns:')
for index, row in duplicated_pids.iterrows():
    # Check if the 'dblp' or 'final_url' columns have values as sets
    if isinstance(row['dblp'], set) or isinstance(row['final_url'], set):
        print(f'PID: {index}')
        if isinstance(row['dblp'], set):
            print(f'  -The "dblp" column has set of URLs: {row["dblp"]}')
        if isinstance(row['final_url'], set):
            print(f'  -The "final_url" column has set of URLs: {row["final_url"]}')

Number of rows with column values as sets for each column:
-The column "name" has 10 rows with values as sets
-The column "country" has 21 rows with values as sets
-The column "institution" has 95 rows with values as sets
-The column "dblp" has 9 rows with values as sets
-The column "final_url" has 5 rows with values as sets

The rows with set values in the "dblp" and "final_url" columns:
PID: 33-6315-1
  -The "dblp" column has set of URLs: {'https://dblp.uni-trier.de/pid/33/6315-1.html', 'https://dblp.org/pid/33/6315-1.html'}
  -The "final_url" column has set of URLs: {'https://dblp.uni-trier.de/pid/33/6315-1.html', 'https://dblp.org/pid/33/6315-1.html'}
PID: 49-8316
  -The "dblp" column has set of URLs: {'https://dblp.org/pers/k/Koutris:Paraschos.html', 'https://dblp.org/pid/49/8316.html'}
PID: c-VassilisChristophides
  -The "dblp" column has set of URLs: {'https://dblp.uni-trier.de/pid/c/VassilisChristophides.html', 'https://dblp.org/pid/c/VassilisChristophides.html'}
  -The "final_

As we can see, several rows have values as sets for some columns. This means that they have multiple values for the same column. We can easily explain this for the columns: 'country', 'institution', 'name' as the data scientists can be part of multiple institutions in different countries and their names can be written in different ways. But for the columns: 'dblp' and 'final_url', we should have only one value for each row. After inspecting the provided links, we can observe they lead to pages with the same data. We will keep the link that starts with 'https://dblp.org/pid/' (to maintain consistency) and remove the rest of the links. We will perform the same operation for the 'final_url' column and not for the 'dblp' column as the 'final_url' will be used for the data collection, while the 'dblp' can just store the different links to the same page.

In [97]:
# For every row where the 'final_url' column is a set, keep the link that starts with 'https://dblp.org/pid/' and remove the rest of the links
for index, row in duplicated_pids[duplicated_pids['final_url'].apply(lambda x: isinstance(x, set))].iterrows():
    # Define a variable to store if such a link is found
    found = False
    for link in row['final_url']:
        if link.startswith('https://dblp.org/pid/'):
            duplicated_pids.loc[index, 'final_url'] = link
            found = True
            break
    if not found:
        print(f'No link that starts with "https://dblp.org/pid/" was found for the PID: {index}')

# Print the number of rows with 'final_url' as a set
print(f'The number of rows with "final_url" as a set is {duplicated_pids[duplicated_pids["final_url"].apply(lambda x: isinstance(x, set))].shape[0]}')

The number of rows with "final_url" as a set is 0


After we have cleaned the duplicated data, we will append the cleaned data for the duplicated PIDs to the non-duplicated data.

In [107]:
# Define a DataFrame to hold the non-duplicated PIDs
non_duplicated_pids = pids_urls_df.drop_duplicates(subset='pid', keep=False)

# Reset the index of duplicated_pids DataFrame
duplicated_pids_reset = duplicated_pids.reset_index()

# Append the cleaned duplicated PIDs to the non-duplicated PIDs
cleaned_pids_urls_df = pd.concat([non_duplicated_pids, duplicated_pids_reset]).sort_values(by='pid').reset_index(drop=True)

# Display the number of rows removed
print(f'{pids_urls_df.shape[0] - cleaned_pids_urls_df.shape[0]} rows were removed because they were duplicates in terms of PIDs')

# Display if there are still duplicated PIDs in the DataFrame
print(f'The number of duplicated PIDs is {cleaned_pids_urls_df[cleaned_pids_urls_df.duplicated(subset="pid", keep=False)].shape[0]}')

# Save the DataFrame to a CSV file
cleaned_pids_urls_df.to_csv('scientists_clean.csv', index=False)

146 rows were removed because they were duplicates in terms of PIDs
The number of duplicated PIDs is 0


As we can see, we have successfully cleaned the data from the duplicated PIDs. Now, we will be using the cleaned data for the actual data collection from the DBLP pages.

### 1.4 Data Collection from DBLP
After a closer inspection of the DBLP pages, we have identified that the data for each scientist is stored in an XML format file that can be accessed by just replacing the '.html' extension with '.xml' for the 'final_url' link. Now, let's build the DataFrame with the links to the XML files for each scientist.

In [108]:
# Read the cleaned CSV file
cleaned_pids_urls_df = pd.read_csv('scientists_clean.csv')

# Creating a new column to hold the XML links
cleaned_pids_urls_df['xml'] = cleaned_pids_urls_df['final_url'].apply(lambda x: x.replace('.html', '.xml'))

# Check if all the XML links are ending with '.xml'
if cleaned_pids_urls_df['xml'].apply(lambda x: x.endswith('.xml')).all():
    print('All the XML links are ending with ".xml"')
    # Save the DataFrame to a CSV file
    cleaned_pids_urls_df.to_csv('scientists_xml.csv', index=False)
    print('The XML links have been added to the DataFrame and saved to a CSV file')
else:
    print('Not all the XML links are ending with ".xml"')

All the XML links are ending with ".xml"
The XML links have been added to the DataFrame and saved to a CSV file


Now, we have the DataFrame with the XML links for each scientist. We will use these links to collect the data for each scientist.

In [110]:
# Get the DataFrame with the XML links from the CSV file
xml_df = pd.read_csv('scientists_xml.csv')

# Define a variable to store the error count that may occur during the data collection
error_cnt = 0

# Iterate over the DataFrame rows
for index, row in tqdm(xml_df.iterrows(), total=xml_df.shape[0]):
    # Check if the directory for the XML files exists, if not create it
    if 'xml_files' not in os.listdir():
        os.mkdir('xml_files')
    
    # Check if the file exists, skip the file if it exists
    if f'{row["pid"]}.xml' in os.listdir('xml_files'):
        print(f'The file {row["pid"]}.xml already exists')
        continue
    
    while True:
        # Send a GET request
        response = requests.get(row['xml'])
        
        # If the status code is 429 (Too Many Requests), wait for 60 seconds before trying again
        if response.status_code == 429:
            print('Too many requests, sleeping for 60 seconds...')
            time.sleep(60)
        else:
            break
    
    # Print Error if the status code is not 200
    if response.status_code != 200:
        error_cnt += 1
        print(f'Error: {response.status_code}')
        print(f'Error for {row["pid"]}: {row["xml"]}')
        continue
    
    # Save the content to a file (create a new file for each PID)
    with open(f'xml_files/{row["pid"]}.xml', 'wb') as file:
        file.write(response.content)

# Print the number of errors
print(f'{error_cnt} errors were encountered')

100%|██████████| 1052/1052 [14:13<00:00,  1.23it/s]

0 errors were encountered





Now, we have collected the data for each scientist in an XML format file. We will have to explore the data to understand how to extract the necessary information for the data scientists.

In [124]:
def traverse_tree(element, tag_tree):
    """Recursive function to traverse the XML tree and build the hierarchy of tags."""
    for child in element:
        if child.tag not in tag_tree:
            tag_tree[child.tag] = {}
        traverse_tree(child, tag_tree[child.tag])

tag_tree = {}

# Iterate over all XML files in the directory
for filename in os.listdir('xml_files'):
    if filename.endswith('.xml'):
        # Parse the XML file
        tree = ET.parse(os.path.join('xml_files', filename))
        # Get the root element
        root = tree.getroot()
        # Traverse the XML tree and build the hierarchy of tags
        traverse_tree(root, tag_tree)

def build_tree(data, parent=None):
    if type(data) is dict:
        for key, value in data.items():
            node = Node(key)
            parent.add_node(node)
            build_tree(value, node)
    elif type(data) is list:
        for index, value in enumerate(data):
            node = Node(str(index))
            parent.add_node(node)
            build_tree(value, node)
    else:
        node = Node(str(data))
        parent.add_node(node)

# Load your JSON data
data = json.loads(json.dumps(tag_tree, indent=4))

# Create the root node
root = Node("dblpperson")
# Build the tree
build_tree(data, root)
# Create a Tree instance and add the root node
tree = Tree()
tree.add_node(root)

# Display the tree
tree

Tree(nodes=(Node(name='dblpperson', nodes=(Node(name='person', nodes=(Node(name='author'), Node(name='note'), …

Exploring the XML tag hierarchy and referring to the https://dblp.org/xml/docu/dblpxml.pdf document (description of the DBLP XML format), we can define the tags that we will use to extract the necessary information for the data scientists. We will use the following tags:

In [127]:
# Define the tags that capture the publications
publication_tags = ['inproceedings', 'article', 'incollection', 'book', 'proceedings', 'phdthesis', 'data', 'www']

# Define the tags that capture the title of the publication
title_tags = ['title']

# Define the tags that caputre the year of the publication
year_tags = ['year']

# Define the tags that capture the authors of the publication
author_tags = ['author', 'editor']

After we have collected and explored the data for each scientist, we can now proceed to the conversion of the XML files to DataFrames.

### 1.5 Data Conversion
First, we have to create a DataFrame for all the publications of the data scientists. These publications will represent the links between the data scientists. We will also use this DataFrame to store information about the publications (title, year, etc.).

In [204]:
# Initialize an empty list to store the data for each paper
papers = []

# Define the given authors
our_authors = pd.read_csv('scientists_clean.csv')['pid'].values

# Define a DataFrame to hold the external authors
external_authors = set()

# Define an error counter
error_cnt = 0

# Iterate over the XML files
for filename in tqdm(os.listdir('xml_files')):
    if filename.endswith('.xml'):
        # Parse the XML file
        tree = ET.parse(os.path.join('xml_files', filename))
        # Get the root element
        root = tree.getroot()
        # Iterate over the <r> elements
        for r in root.findall('r'):
            # Define a variable to store if a publication tag is found
            found = False
            # Get the publication element for every possible publication tag
            for pt in publication_tags:
                publication = r.find(pt)
                if publication is not None:
                    found = True
                    # Extract the title, year, doi, and pids of all authors
                    title = publication.find('title').text if publication.find('title') is not None else None
                    if title is None:
                        title = ET.tostring(publication.find('title'))
                    key = publication.get('key') if publication.get('key') is not None else None
                    year = publication.find('year').text if publication.find('year') is not None else None
                    authors = []
                    e_authors = []
                    for a_tag in author_tags:
                        # Add the authors for the current publication to the list
                        authors.extend([author.get('pid').replace('/', '-') for author in publication.findall(a_tag) if author.get('pid').replace('/', '-') in our_authors])
                        # Add the external authors for the current publication to the list
                        e_authors.extend([author.get('pid').replace('/', '-') for author in publication.findall(a_tag) if author.get('pid').replace('/', '-') not in our_authors])
                        # Add the external authors for the current publication to the set
                        external_authors.update(e_authors)
                    # Check if the title, year or key
                    if title is not None and year is not None and key is not None and len(authors) > 0:
                        # Append the extracted data to the list
                        papers.append([title, year, key, authors, e_authors, pt, filename])
                    else:
                        print(f'Error: {filename}, {key} --> title: {title}, year: {year}, key: {key}, authors: {authors}')
                        error_cnt += 1
            if not found:
                print(f'Error: {filename}, {r}')
                error_cnt += 1
                    
# Display if there are any errors
if error_cnt == 0:
    print('No errors occurred')
else:
    print(f'{error_cnt} errors occurred')

# Convert the list to a pandas DataFrame
papers_df = pd.DataFrame(papers, columns=['Title', 'Year', 'Key', 'Authors', 'External Authors', 'Publication Type' ,'file'])
# Convert lists in 'Authors' and 'External Authors' columns to tuples
papers_df['Authors'] = papers_df['Authors'].apply(str)
papers_df['External Authors'] = papers_df['External Authors'].apply(str)
# Group by all columns except 'file' and aggregate the values
papers_df = papers_df.groupby(['Title', 'Year', 'Key', 'Authors', 'External Authors', 'Publication Type']).agg(lambda x: set(x)).reset_index()
# Convert the 'files' set to string
papers_df['file'] = papers_df['file'].apply(str)
# Remove the duplicates
papers_df = papers_df.drop_duplicates()
# Save the DataFrame to a CSV file
papers_df.to_csv('papers.csv', index=False)

# Convert the set to a DataFrame
external_authors_df = pd.DataFrame(list(external_authors), columns=['pid'])
# Save the DataFrame to a CSV file
external_authors_df.to_csv('external_authors.csv', index=False)

100%|██████████| 1052/1052 [00:16<00:00, 64.56it/s]


No errors occurred


We have successfully converted the XML files to a DataFrame. We have the information about the publications of the data scientists: title, year, key, authors, and external authors. We have also created a DataFrame for the external authors (not given in the initial data). We can now proceed to the data analysis.

### 1.6 Data Analysis & Cleaning
In this section, we will perform simple data analysis on the collected data.

In [239]:
# Read the CSV files into DataFrames
papers_df = pd.read_csv('papers.csv')

# Show the size of the DataFrame
print(f'The DataFrame has {papers_df.shape[0]} rows and {papers_df.shape[1]} columns')

# Print the number of duplicated values
print(f'\n The number of duplicated values is {papers_df.duplicated().sum()}')

# Display the number of duplicated values for each column
print('\nNumber of duplicated values for each column:')
for column in ['Title', 'Key']:
    print(f'-The column "{column}" has {papers_df.duplicated(subset=column, keep=False).sum()} duplicated values')

# Show some rows from the DataFrame with duplicated titles
duplicated_titles = papers_df[papers_df.duplicated(subset='Title', keep=False)]
duplicated_titles

The DataFrame has 102549 rows and 7 columns

 The number of duplicated values is 0

Number of duplicated values for each column:
-The column "Title" has 15431 duplicated values
-The column "Key" has 2 duplicated values


Unnamed: 0,Title,Year,Key,Authors,External Authors,Publication Type,file
66,b'<title><i>k</i>-Anonymity.</title>\n',2007,series/ais/CirianiVFS07a,['s-PSamarati'],"['c-ValentinaCiriani', 'v-SabrinaDeCapitanidiV...",incollection,{'s-PSamarati.xml'}
67,b'<title><i>k</i>-Anonymity.</title>\n',2018,reference/db/Domingo-Ferrer18d,['d-JDomingoFerrer'],[],incollection,{'d-JDomingoFerrer.xml'}
81,"""A Virus Has No Religion"": Analyzing Islamopho...",2021,conf/ht/ChandraRSGBK21,['97-5147'],"['152-3504', '283-5340', '283-5610', '06-5843-...",inproceedings,{'97-5147.xml'}
82,"""A Virus Has No Religion"": Analyzing Islamopho...",2021,journals/corr/abs-2107-05104,['97-5147'],"['152-3504', '283-5340', '283-5610', '06-5843-...",article,{'97-5147.xml'}
95,"""Diversity and Uncertainty in Moderation"" are ...",2022,conf/naacl/KumarDC22,['29-5841'],"['234-8570', '71-4474']",inproceedings,{'29-5841.xml'}
...,...,...,...,...,...,...,...
102464,Éditorial.,2008,journals/isi/BoucelmaHLP08,['p-JMPetit'],"['b-OmarBoucelma', 'h-MohandSaidHacid', 'l-The...",article,{'p-JMPetit.xml'}
102465,Éditorial.,2009,journals/isi/ServigneZ09,['z-KarineZeitouni'],['49-5791'],article,{'z-KarineZeitouni.xml'}
102466,Éditorial.,2012,journals/isi/Bouganim12,['b-LucBouganim'],[],article,{'b-LucBouganim.xml'}
102467,Éditorial.,2014,journals/isi/Petit14,['p-JMPetit'],[],article,{'p-JMPetit.xml'}


As we can see, we have some duplicated values again. This time, the duplicated values are caused by the same publication being assigned to multiple conferences, journals, etc. As a result, we have publications with same title, year, authors, etc. but different keys. We will remove the duplicated values based on the 'Title' and 'Authors' columns.

In [240]:
# Remove the duplicated values based on the 'Title', 'Year', 'Authors', and 'External Authors' columns
papers_df = papers_df.drop_duplicates(subset=['Title', 'Authors'])

# Show the size of the DataFrame
print(f'The DataFrame has {papers_df.shape[0]} rows and {papers_df.shape[1]} columns')

# Display the number of duplicated values for each column
print('\nNumber of duplicated values for each column:')
for column in ['Title', 'Key']:
    print(f'-The column "{column}" has {papers_df.duplicated(subset=column, keep=False).sum()} duplicated values')

# Show some rows from the DataFrame with duplicated titles
duplicated_titles = papers_df[papers_df.duplicated(subset='Title', keep=False)]
duplicated_titles

The DataFrame has 94942 rows and 7 columns

Number of duplicated values for each column:
-The column "Title" has 605 duplicated values
-The column "Key" has 0 duplicated values


Unnamed: 0,Title,Year,Key,Authors,External Authors,Publication Type,file
66,b'<title><i>k</i>-Anonymity.</title>\n',2007,series/ais/CirianiVFS07a,['s-PSamarati'],"['c-ValentinaCiriani', 'v-SabrinaDeCapitanidiV...",incollection,{'s-PSamarati.xml'}
67,b'<title><i>k</i>-Anonymity.</title>\n',2018,reference/db/Domingo-Ferrer18d,['d-JDomingoFerrer'],[],incollection,{'d-JDomingoFerrer.xml'}
227,(,2007,conf/waim/WongLYHFP07,"['w-RaymondChiWingWong', 'p-JianPei']","['30-2791', '95-578-1', '13-4665', 'f-AdaWaiCh...",inproceedings,"{'p-JianPei.xml', 'w-RaymondChiWingWong.xml'}"
228,(,2009,journals/jiis/WongLFW09,"['w-RaymondChiWingWong', 'w-KeWang']","['20-1583', 'f-AdaWaiCheeFu']",article,"{'w-KeWang.xml', 'w-RaymondChiWingWong.xml'}"
300,10,2007,conf/icde/AntovaKO07,"['a-LyublenaAntova', 'k-ChristophKoch']",['o-DanOlteanu'],inproceedings,"{'k-ChristophKoch.xml', 'a-LyublenaAntova.xml'}"
...,...,...,...,...,...,...,...
102463,Éditorial.,2006,journals/isi/Mothe06,['m-JosianeMothe'],[],article,{'m-JosianeMothe.xml'}
102464,Éditorial.,2008,journals/isi/BoucelmaHLP08,['p-JMPetit'],"['b-OmarBoucelma', 'h-MohandSaidHacid', 'l-The...",article,{'p-JMPetit.xml'}
102465,Éditorial.,2009,journals/isi/ServigneZ09,['z-KarineZeitouni'],['49-5791'],article,{'z-KarineZeitouni.xml'}
102466,Éditorial.,2012,journals/isi/Bouganim12,['b-LucBouganim'],[],article,{'b-LucBouganim.xml'}


After removing the duplicated publications, we can see that there are some very strange publications with some characters are titles. We will remove them as well as they are not useful for the analysis and may be some errors in the data collection. We will remove all publications with less than 2 words in the title.

We can also see that there are many publications with only 1 author. We will remove them as well as they are not useful for the analysis, they are not collaborations and cannot be used for the network analysis.

In [241]:
# Remove the publications with less than 2 words in the title
papers_df = papers_df[papers_df['Title'].apply(lambda x: len(x.split()) >= 2)]

# Remove the publications with sum of authors and external authors less than 2
papers_df = papers_df[papers_df.apply(lambda r: len(ast.literal_eval(r['Authors'])) + len(ast.literal_eval(r['External Authors'])) >= 2, axis=1)]

# Show the size of the DataFrame
print(f'The DataFrame has {papers_df.shape[0]} rows and {papers_df.shape[1]} columns')

# Display the number of duplicated values for each column
print('\nNumber of duplicated values for each column:')
for column in ['Title', 'Key']:
    print(f'-The column "{column}" has {papers_df.duplicated(subset=column, keep=False).sum()} duplicated values')

# Show some rows from the DataFrame with duplicated titles
duplicated_titles = papers_df[papers_df.duplicated(subset='Title', keep=False)]
duplicated_titles

The DataFrame has 91001 rows and 7 columns

Number of duplicated values for each column:
-The column "Title" has 270 duplicated values
-The column "Key" has 0 duplicated values


Unnamed: 0,Title,Year,Key,Authors,External Authors,Publication Type,file
1373,A Decentralized Approach for Controlled Sharin...,2003,conf/dbsec/BertinoFS03,"['b-ElisaBertino', 'f-ElenaFerrari', 's-AnnaCi...",[],inproceedings,"{'f-ElenaFerrari.xml', 'b-ElisaBertino.xml', '..."
1374,A Decentralized Approach for Controlled Sharin...,2006,conf/cscwd/BertinoS06,"['b-ElisaBertino', 's-AnnaCinziaSquicciarini']",[],inproceedings,"{'b-ElisaBertino.xml', 's-AnnaCinziaSquicciari..."
3826,A Query Language for XML.,1998,conf/w3c/FernandezS98,['s-DanSuciu'],['f-MaryFFernandez'],inproceedings,{'s-DanSuciu.xml'}
3827,A Query Language for XML.,1999,journals/cn/DeutschFFLS99,"['d-AlinDeutsch', 'h-AlonYHalevy', 's-DanSuciu']","['f-MaryFFernandez', '70-3007']",article,"{'h-AlonYHalevy.xml', 's-DanSuciu.xml', 'd-Ali..."
3859,A Queueing-Theoretic Framework for Vehicle Dis...,2019,conf/icde/0003FCW19,"['76-185-3', 'c-LeiChen0002']","['97-164', '181-2834']",inproceedings,"{'c-LeiChen0002.xml', '76-185-3.xml'}"
...,...,...,...,...,...,...,...
100122,Web Services.,2004,journals/insk/KossmannL04,['k-DonaldKossmann'],['l-FrankLeymann'],article,{'k-DonaldKossmann.xml'}
100913,WiSer: A Highly Available HTAP DBMS for IoT Ap...,2019,conf/bigdataconf/BarberSTTWGGL0M19,"['27-3375', '86-9726', '07-11533', 'm-CMohan',...","['33-3527', '16-5412', '96-122', '192-4573', '...",inproceedings,"{'27-3375.xml', 'm-ReneMuller.xml', '07-11533...."
100914,WiSer: A Highly Available HTAP DBMS for IoT Ap...,2019,journals/corr/abs-1908-01908,"['m-CMohan', 'm-ReneMuller', 'r-VijayshankarRa...","['33-3527', '96-122', '192-4573', '21-2457', '...",article,"{'27-3375.xml', 'm-ReneMuller.xml', '07-11533...."
101203,Workshop Organizers' Message.,2009,conf/dasfaa/SadiqDZYADLX09,"['z-XiaofangZhou', 'a-WGAref']","['s-SWSadiq', '97-1199', '86-2859-1', 'd-AlexD...",inproceedings,"{'a-WGAref.xml', 'z-XiaofangZhou.xml'}"


Still, we have some more duplicated values. After further inspection we can see that the duplicated values are caused by the same publication being assigned to multiple conferences, journals, etc. but with different keys. Also, we can see that the duplicated publications' authors are subsets of each other. For example, the publication with authors A, B, C is duplicated with authors A, B. We will combine the duplicated publications and keep the earliest year and the authors that are the superset of the authors of the duplicated publications.

In [243]:
# Define a list to hold the combined publications to be appended to the DataFrame later
combined_pubs = []

# For each unique title in the duplicated titles, combine the duplicated publications' authors and keep the earliest year
for title in duplicated_titles['Title'].unique():
    # Get the duplicated publications with the same title
    duplicated_pubs = papers_df[papers_df['Title'] == title]
    # Get the earliest year
    earliest_year = duplicated_pubs['Year'].min()
    # Get the authors of the duplicated publications
    authors = set()
    e_authors = set()
    # Get the publication types of the duplicated publications
    p_types = set()
    # Choose a random key
    key = duplicated_pubs['Key'].iloc[0]
    # Choose a random file
    file = duplicated_pubs['file'].iloc[0]
    for index, row in duplicated_pubs.iterrows():
        authors.update(ast.literal_eval(row['Authors']))
        e_authors.update(ast.literal_eval(row['External Authors']))
        p_types.add(row['Publication Type'])
    if len(p_types) > 1:
        p_types = str(p_types)
    else:
        p_types = p_types.pop()
    # Remove the duplicated publications
    papers_df = papers_df[~papers_df.index.isin(duplicated_pubs.index)]
    # Append the combined publication
    combined_pubs.append([title, earliest_year, key, authors, e_authors, p_types, file])

# Append the combined publications to the DataFrame
papers_df = pd.concat([papers_df, pd.DataFrame(combined_pubs, columns=papers_df.columns)]).reset_index(drop=True)

# Show the size of the DataFrame
print(f'The DataFrame has {papers_df.shape[0]} rows and {papers_df.shape[1]} columns')

# Display the number of duplicated values for each column
print('\nNumber of duplicated values for each column:')
for column in ['Title', 'Key']:
    print(f'-The column "{column}" has {papers_df.duplicated(subset=column, keep=False).sum()} duplicated values')

# Show some rows from the DataFrame with duplicated titles
duplicated_titles = papers_df[papers_df.duplicated(subset='Title', keep=False)]
duplicated_titles

The DataFrame has 90837 rows and 7 columns

Number of duplicated values for each column:
-The column "Title" has 0 duplicated values
-The column "Key" has 0 duplicated values


Unnamed: 0,Title,Year,Key,Authors,External Authors,Publication Type,file


After all the cleaning, we finally have a clean DataFrame with no duplicating values. We will now process to saving the DataFrame to a CSV file.

In [244]:
# Save the DataFrame to a CSV file
papers_df.to_csv('papers_clean.csv', index=False)

Now, we will analyze the data in the cleaned DataFrame.

In [248]:
# Read the CSV file into a DataFrame
papers_df = pd.read_csv('papers_clean.csv')

# Display the size of the DataFrame
print(f'The DataFrame has {papers_df.shape[0]} rows and {papers_df.shape[1]} columns')

# Display the number of removed rows from the initial DataFrame
print(f'{pd.read_csv("papers.csv").shape[0] - papers_df.shape[0]} rows were removed from the initial DataFrame')

# Print the number of rows with duplicated titles
print(f'The number of rows with duplicated titles is {papers_df[papers_df.duplicated(subset="Title", keep=False)].shape[0]}')

The DataFrame has 90837 rows and 7 columns
11712 rows were removed from the initial DataFrame
The number of rows with duplicated titles is 0


We have successfully cleaned the data and removed the duplicated values. Now, we will proceed to the network construction and analysis.

## 2. Network Analysis