# CSV Dataset Analysis

- The objective of this file is to identify glaring problems of the csv datasets pulled from the NeurIPS, IJCAI, MLCH, ICML, and AAAI XML conferences from 2010-2020
- If possible, this file will look into cleaning harmful/missing data as well as supplement gender and author affiliation via truncating the dataset to pull relevant information and insights

In [4]:
import pandas as pd

## Neur-IPS

- Below is a preview of a couple of columns we'll work with from the Neur-IPS dataset:
    - first-author: The name of the first author of the paper
    - last-author: The name of the last author written of the same paper
    - paper-url: The link to the paper
    - DBLP-url: The link to the DBLP page that hosts all of the papers that which we are pulling the data from

In [16]:
data_Neur_IPS = pd.read_csv(r"Output-CSVs\NeurIPS-XML-output.csv")
print(data_Neur_IPS)

      Unnamed: 0  year      first-author          last-author  \
0              0  2017     Sheng Li 0001          Yun Fu 0001   
1              1  2017    Jiajun Wu 0001  Josh Tenenbaum 0001   
2              2  2017      Rong Ge 0001            Tengyu Ma   
3              3  2017    Ziyu Wang 0001        Nicolas Heess   
4              4  2017      Ping Li 0001       Martin Slawski   
...          ...   ...               ...                  ...   
6393        6393  2010     Yu Zhang 0006              Qian Xu   
6394        6394  2010  Hongbo Zhou 0001          Qiang Cheng   
6395        6395  2010      Jun Zhu 0001         Eric P. Xing   
6396        6396  2010  Martin Zinkevich       Lihong Li 0001   
6397        6397  2010  John D. Lafferty         Aron Culotta   

     journal/conference                                              title  \
0                  NIPS  Matching on Balanced Nonlinear Representations...   
1                  NIPS   Learning to See Physics via Visual De

### Problem 1: paper-url

From face value, we already can see an issue with the paper-url(s). It looks to be that every paper besides the last (6398th) are invalid due to the structure of the link. Unlike that of the functioning last link, the previous paper-urls contain an additional "/hash" that leads to an unfound page.

In [17]:
data_Neur_IPS.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6398 entries, 0 to 6397
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          6398 non-null   int64 
 1   year                6398 non-null   int64 
 2   first-author        6398 non-null   object
 3   last-author         6398 non-null   object
 4   journal/conference  6397 non-null   object
 5   title               6398 non-null   object
 6   number-of-authors   6398 non-null   int64 
 7   paper-url           6398 non-null   object
 8   DBLP-url            6398 non-null   object
dtypes: int64(3), object(6)
memory usage: 450.0+ KB


### Missing Data

In the case of Neur-IPS, we only have one missing value out of the 6398 rows. We can disregard this missing item since it belongs to "journal/conference", and we already know that it is from Neur-IPS (confirmed in it's affiliated urls).

In [18]:
print(data_Neur_IPS[data_Neur_IPS.isna().any(axis=1)])

     Unnamed: 0  year    first-author    last-author journal/conference  \
679         679  2017  Isabelle Guyon  Roman Garnett                NaN   

                                                 title  number-of-authors  \
679  Advances in Neural Information Processing Syst...                  7   

                                     paper-url  \
679  https://proceedings.neurips.cc/paper/2017   

                                DBLP-url  
679  https://dblp.org/rec/conf/nips/2017  


### Filtering Data Pt. 1

We are mainly focused on certain rows/columns, so we will filter out the data that is not the first/last author, paper-url, and/or DBLP-url

In [20]:
data_Neur_IPS_updated = data_Neur_IPS.drop(data_Neur_IPS.columns[4:7],axis=1)
data_Neur_IPS_updated = data_Neur_IPS_updated.drop(data_Neur_IPS_updated.columns[0:2],axis=1)
print(data_Neur_IPS_updated.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6398 entries, 0 to 6397
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   first-author  6398 non-null   object
 1   last-author   6398 non-null   object
 2   paper-url     6398 non-null   object
 3   DBLP-url      6398 non-null   object
dtypes: object(4)
memory usage: 200.1+ KB
None


In [21]:
print(data_Neur_IPS_updated)

          first-author          last-author  \
0        Sheng Li 0001          Yun Fu 0001   
1       Jiajun Wu 0001  Josh Tenenbaum 0001   
2         Rong Ge 0001            Tengyu Ma   
3       Ziyu Wang 0001        Nicolas Heess   
4         Ping Li 0001       Martin Slawski   
...                ...                  ...   
6393     Yu Zhang 0006              Qian Xu   
6394  Hongbo Zhou 0001          Qiang Cheng   
6395      Jun Zhu 0001         Eric P. Xing   
6396  Martin Zinkevich       Lihong Li 0001   
6397  John D. Lafferty         Aron Culotta   

                                              paper-url  \
0     https://proceedings.neurips.cc/paper/2017/hash...   
1     https://proceedings.neurips.cc/paper/2017/hash...   
2     https://proceedings.neurips.cc/paper/2017/hash...   
3     https://proceedings.neurips.cc/paper/2017/hash...   
4     https://proceedings.neurips.cc/paper/2017/hash...   
...                                                 ...   
6393  https://proceedi

### Note about author labeling:

Taking just three random examples from the dataset, we see that the labels of the author can be misleading. For example, it may originally be inferred that an additional numerical value (i.e. 0001) attached alongside an author's name means that the author has posted multiple works in this conference. Since this data is strictly pulled from this conference, it is unlikely to assume the possibility that these numbers include the papers the author may have passed in other conferences.

In the following cell, we have checked and printed all the sequences where the author's name was found, whether they were first author on the paper or the last. We check through each possibility, whether it's an author with no numerical value attached to their name, one with a value of 0001, or one greater than the former.

The first test case of author "Martin Zinkevich" demonstrates that although he has no numerical value attached to his name, that he appears in multiple papers. This eliminates the other possibility that authors may have their name written in between other authors of other papers. Even if that were the case, then we at least know that it doesn't influence the number next to their name.

In [22]:
#test to check author with no numerical value
print(data_Neur_IPS_updated[data_Neur_IPS_updated['first-author'].astype(str).str.contains('Martin Zinkevich')])
print(data_Neur_IPS_updated[data_Neur_IPS_updated['last-author'].astype(str).str.contains('Martin Zinkevich')])

#test to check author with numerical value of 0001
print(data_Neur_IPS_updated[data_Neur_IPS_updated['first-author'].astype(str).str.contains('Sheng Li')])
print(data_Neur_IPS_updated[data_Neur_IPS_updated['last-author'].astype(str).str.contains('Sheng Li')])

#test to check author with numerical value greater than 0001
print(data_Neur_IPS_updated[data_Neur_IPS_updated['first-author'].astype(str).str.contains('Yu Zhang')])
print(data_Neur_IPS_updated[data_Neur_IPS_updated['last-author'].astype(str).str.contains('Yu Zhang')])

          first-author     last-author  \
6396  Martin Zinkevich  Lihong Li 0001   

                                              paper-url  \
6396  https://proceedings.neurips.cc/paper/2010/hash...   

                                           DBLP-url  
6396  https://dblp.org/rec/conf/nips/ZinkevichWSL10  
         first-author       last-author  \
4876  Dale Schuurmans  Martin Zinkevich   

                                              paper-url  \
4876  https://proceedings.neurips.cc/paper/2016/hash...   

                                          DBLP-url  
4876  https://dblp.org/rec/conf/nips/SchuurmansZ16  
    first-author  last-author  \
0  Sheng Li 0001  Yun Fu 0001   

                                           paper-url  \
0  https://proceedings.neurips.cc/paper/2017/hash...   

                                 DBLP-url  
0  https://dblp.org/rec/conf/nips/0001F17  
Empty DataFrame
Columns: [first-author, last-author, paper-url, DBLP-url]
Index: []
       first-author     

## Gender Identifier

Given the current data, we will implement a web-scraping scipt that will return the gender (male/female) of every author in the first-author/last-author columns of the dataset.

In [23]:
from re import I
import requests
from bs4 import BeautifulSoup

In [24]:
data_Neur_IPS_updated.head()

Unnamed: 0,first-author,last-author,paper-url,DBLP-url
0,Sheng Li 0001,Yun Fu 0001,https://proceedings.neurips.cc/paper/2017/hash...,https://dblp.org/rec/conf/nips/0001F17
1,Jiajun Wu 0001,Josh Tenenbaum 0001,https://proceedings.neurips.cc/paper/2017/hash...,https://dblp.org/rec/conf/nips/0001LKFT17
2,Rong Ge 0001,Tengyu Ma,https://proceedings.neurips.cc/paper/2017/hash...,https://dblp.org/rec/conf/nips/0001M17
3,Ziyu Wang 0001,Nicolas Heess,https://proceedings.neurips.cc/paper/2017/hash...,https://dblp.org/rec/conf/nips/0001MRFWH17
4,Ping Li 0001,Martin Slawski,https://proceedings.neurips.cc/paper/2017/hash...,https://dblp.org/rec/conf/nips/0001S17


In [38]:
from googlesearch import search
import pandas as pd

In [39]:
def findGender(urlLink):
    gender = ''
    ## put a url where the desciption of the author is avilable
    url = urlLink
    htmlText = requests.get(url).text
    soup = BeautifulSoup(htmlText, 'html.parser')


    # gathers all the p tags from the html
    soupstring = str(soup.find_all("p"))
    shetags = [" she ", " She "," her ", " Her "," hers "," Hers "]
    hetags = [" he ", " He "," his ", " His "," him "," Him "]

    ##checks if male or female pronouns are present in the desciption
    containsfemale = any(femalepronouns in soupstring for femalepronouns in shetags)
    containsmale = any(malepronouns in soupstring for malepronouns in hetags)

    if containsfemale:
        gender = "female"
    else:
        gender = "male"
    return gender

In [44]:
#Questions:
#should we make it like a dictionary instead?
#are we just focused on gender breakdown or should we also include who is what gender while we do it

def genderBreakdown():
    Neur_IPS_male_authors = 0
    Neur_IPS_female_authors = 0

    #iterates through all the authors in the dataset
    for col in data_Neur_IPS_updated[['first-author', 'last-author']]:
        colObj = data_Neur_IPS_updated[col]
        print('Column Name : ', col)
        print(colObj.values)

        #goes through Google to find links related to that authors name
        lst_authors = colObj.values
        for item in lst_authors:
            query=item
            #gets around 10 links for each author, then sends it to the helper function to determine the gender of that author
            for j in search(query, num_results=10):
                print('test')
                if findGender(j)=="male":
                    Neur_IPS_male_authors+=1
                    break
                else:
                    Neur_IPS_female_authors+=1
                    break
    return Neur_IPS_male_authors, Neur_IPS_female_authors

In [43]:
#need to verify results still, but it's too slow anyways. I ended up having to close the program.
#also I need to figure out why the parameters for the search method is different in VSCode
genderBreakdown()

Column Name :  first-author
['Sheng Li 0001' 'Jiajun Wu 0001' 'Rong Ge 0001' ... 'Jun Zhu 0001'
 'Martin Zinkevich' 'John D. Lafferty']


TypeError: search() got an unexpected keyword argument 'num'