Note: Plaintext AI Text Searcher (PATS) is now archived, but the source code is still available to view in this repository. Plaintext Group was a nonpartisan, technology innovation policy initiative developed by Schmidt Futures. For more information, see https://www.schmidtfutures.com/our-work/plaintext/.
Plaintext AI Text Searcher (PATS) is a computational system developed to extract bibliometric metadata (institution, country, keywords) from AI research papers published at top conferences, with the goal of providing a nuanced understanding of the state of AI research globally and to guide critical discussion on AI policy.
The dataset curated by PATS is 50-100x larger than any other dataset of its kind, which previously only consisted of a few hundred papers analyzed manually. Our dataset encompasses every paper published between 2013 - 2019 (19,861 total) from the following top AI conferences:
- Association for the Advancement of Artificial Intelligence (AAAI),
- Conference on Computer Vision and Pattern Recognition (CVPR),
- International Conference on Learning Representations (ICLR),
- The International Conference on Machine Learning (ICML), and
- Conference on Neural Information Processing Systems (NeurIPS).
These conferences were used as a quality filter since they are the most prestigious AI conferences by h-index.
The dataset is stored in the Data
directory and contains the following papers:
Conference | # of papers |
---|---|
AAAI | 5,280 |
CVPR | 5,296 |
ICLR | 1,419 |
ICML | 3,008 |
NeurIPS | 4,858 |
Total | 19,861 |
The dataset is scraped from multiple conference websites, and the scraping code is available in the PaperScraper
directory. The main entrypoint for scraping, which contains all the links where we scrape from, is PaperScraper/Main.py
.
- PATS Set-Up Instructions and Technical Details
- PATS API Overview
- PATS Policy Questions and Analysis Introduction
- PATS Limitations
Make sure you have Python 3.3+ installed, and either follow the Conda or Virtual Environment Setup outlined below.
Conda
First, make sure you’ve installed all of the requirements. We would recommend setting up your own Conda environment (installation instructions), so that you can manage your own dependencies without interfering with any of your other existing packages. You can do this by entering the following in a shell:
conda create -n pats python=3.7 anaconda
Activate the environment using:
conda activate pats
Next, make sure to clone the repo and install requirements here using
git clone https://github.com/PlaintextGroup/pats
cd pats
pip install -r requirements.txt
That’s it, you’re ready to start making some queries using PATS!
Virtual Environment
Make sure to clone the repo using:
git clone https://github.com/PlaintextGroup/pats
cd pats
Now, make your virtualenv. In this example, we’ve called it venv. Then install requirements:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Then you should be good to go!
As a summary:
Start in the PATS/
directory.
- Fill in the required fields within
config.py
. - Select whether you want a plot or not. If you do, make sure to turn
plot_data
to True. If you don’t, the required data will be displayed on the Terminal as plaintext output. - Select whether you want an exported CSV. If you do, make sure to turn
export_csv
to True. - Run
python analyzer.py
. - Graphs will be generated if chosen. To re-run, exit out of the graphs and then repeat!
We’ll take a deeper dive analyzing these steps and each of the filters below.
First, let’s take a deeper dive into the code to understand exactly what is going on. We’ll go over the relative file structure first to give a better sense of how everything is organized. If we look at the current folder in pats/
, we see a mixture of Python scripts, Python notebooks, and folders. We’ll go through the most important ones, which are analyzer.py
and config.py
.
analyzer.py
This is the primary file that drives much of the analysis. It compiles all the documents, and aferwards, examines the configurations that have been enabled and generates the specific graphs or outputs that have been specified. You will NOT have to touch this file to run PATS.
config.py
This is the main file you will edit when interacting with PATS. You will be changing various fields within this in order to create the proper query. This file consists of multiple JSON-like dictionary objects in which various fields are stored. Each field, explained below, can be configured differently in order to get the proper chart that we want.
To start off, we have several filters to narrow down the amount of data that we want analyzed. These are divided between input and output filters. Your input filters will filter what data you want analyzed. Your output filters will filter the format and style of what is output, most commonly either a graph or a csv.
country_analyzed
- Comma separated list of strings specifying which countries are to be compared. This limits outputs to have at least one contributing author from the chosen country/countries (doesn’t ensure that all authors are from the same country). For analysis purposes, if a paper has multinational associations, it will count as a paper with both countries of origin. For example, if a paper has country affiliations of both France and the US, it will be considered both a French and American paper. This filter will be applied across all subsequent queries where a country is mentioned. Leave this empty if you want papers from all countries analyzed.
Example 1: I want to analyze all papers where at least one contributing author is from Germany.
Input 1: "country_analyzed": ['Germany']
Example 2: I want to analyze all papers where at least one contributing author is either from the US, China, or Germany.
Input 2: "country_analyzed": ['US', 'China', 'Germany']
Example 3: I want to analyze all papers, regardless of author country affiliation (all countries analyzed).
Input 3: "country_analyzed": []
A unique list of all countries that can be analyzed/placed in this list is available at country_options.csv.
years_analyzed
- Comma separated list of integers specifying which years are to be compared. Will be applied across all subsequent queries where a year is mentioned. Leave empty if you want papers from all years analyzed (2013 - 2019).
Example 1: I want to analyze all papers from 2013.
Input 1: "years_analyzed": [2013]
Example 2: I want to analyze all papers coming from 2013 - 2015.
Input 2: "years_analyzed": [2013, 2014, 2015]
Example 3: I want to analyze all papers from any year in my dataset (2013 - 2019).
Input 3: "years_analyzed": []
A unique list of all years that can be analyzed/placed in this list is available at year_options.csv. under year options.
institution_analyzed
- Comma separated list of strings specifying which institutions are to be compared. Similar to other fields, this filters by papers where there is at least one contributing author from the chosen institution (doesn’t ensure that all authors are from the same institution. For analysis purposes, if a paper has multi-institutional associations, it will count as a paper published by all institutions. For example, if a paper has institutional affiliations of both Facebook and Google, it will be considered a paper from both Facebook and Google. This filter will be applied across all subsequent queries where institutions are mentioned. Leave empty if you want papers from all institutions analyzed.
Example 1: I want to analyze all papers where at least one contributing author is from Google.
Input 1: "institution_analyzed": ['Google']
Example 2: I want to analyze all papers where at least one contributing author is either from Google, SenseTime Group Ltd, or the Massachusetts Institute of Technology.
Input 2:
"institution_analyzed": ['Google', 'SenseTime Group Ltd.', 'Massachusetts Institute of Technology']
Example 3: I want to analyze all papers, regardless of author institutional affiliation (all institutions analyzed).
Input 3: "institution_analyzed": []
A unique list of all institutions that can be analyzed/placed in this list is available at institution_options.csv. under institution options.
institution_type_analyzed
- Comma separated list of strings specifying which institution types are to be compared. Similar to other fields, this filters by papers where there is at least one institution type from the chosen paper (doesn’t ensure that all institutions are of the same institution type. For analysis purposes, if a paper has multi-institutional associations, it will count as a paper published by all institution types. For example, if a paper has institutional affiliations of both Company and Academia, it will be considered a paper published by both company and academia. This filter will be applied across all subsequent queries where institutions are mentioned. Leave empty if you want papers from all institution types analyzed.
Example 1: I want to analyze all papers where at least one contributing author is from Academia.
Input 1: "institution_type_analyzed": ['Academia']
Example 2: I want to analyze all papers where at least one contributing author is either from Government, Academia, or an Independent Researcher.
Input 2:
"institution_type_analyzed": ['Government', 'Academia', 'Independent Research']
Example 3: I want to analyze all papers, regardless of author institutional type affiliation (all institution types analyzed).
Input 3: "institution_type_analyzed": []
A unique list of all institution types that can be analyzed/placed in this list is available at institution_type_options.csv.
institution_type_dependent
- This will be a boolean that says I want my institution types to be dependent upon the countries that I have selected. This allows me to be more specific as to how I want my countries and papers to be filtered. Whereas before if we specified in institution_type_analyzed
what we wanted, the filter would be applied to all listed countries. However, this filter allows for greater precision by filtering each country by its specified institution type. This requires the length of the country to be the same length as the institution type, otherwise the program will exit. institution_type_analyzed
must also be specified in the correct order corresponding to whatever needs to be analyzed.
Example 1: I want to analyze all US academic papers vs Chinese company papers.
Input 1:
"country_analyzed": ['US', 'China'],
"institution_type_analyzed": ['Academia', 'Company'],
"institution_type_dependent": True
Example 2: I want to analyze US academic papers, Chinese Government papers, and Brazilian Company Papers.
Input 2:
"country_analyzed": ['US', 'China', 'Brazil'],
"institution_type_analyzed": ['Academia', 'Government', 'Company'],
"institution_type_dependent": True
Example 3: I want to analyze academic and company papers from both US and China
Input 3:
"country_analyzed": ['US', 'China'],
"institution_type_analyzed": ['Company', 'Academia'],
"institution_type_dependent": False
A unique list of all institution types that can be analyzed/placed in this list is available at institution_type_options.csv.
conferences_analyzed
- Comma separated list of lowercase strings specifying which conferences are to be compared. Will be applied across all subsequent queries where conferences are mentioned. Leave empty if you want papers from all conferences analyzed.
Example 1: I want to analyze all papers from ICML.
Input 1: "conferences_analyzed": ['icml']
Example 2: I want to analyze all papers coming from ICML, CVPR, and NeuRIPS.
Input 2: "conferences_analyzed": ['icml', 'cvpr', 'neurips']
Example 3: I want to analyze all papers from all conferences.
Input 3: "conferences_analyzed": []
A unique list of all conferences that can be analyzed/placed in this list is available at conference_options.csv.
keywords/min_keyword_count
- An array taking in strings with keywords specified. It will search through the paper text using exact string matching and return the papers that fulfill all the keywords required. The minimum keyword count is set in min_keyword_count
and is defaulted to a value of 5, meaning that the word must appear for a minimum of 5 times in order to be filtered in the final dataframe.
**Example 1: ** I want to analyze all papers that mention the word 'education' at least 5 times.
**Input 1: **
"keywords": ['education'],
"min_keyword_count": 5
Example 2: I want to analyze all papers that mention the words 'education', 'writing', AND 'essays' at least 10 times.
Input 2:
"keywords": ['education', 'writing', 'essays'],
"min_keyword_count": 10
Example 3: I want to analyze all papers without specifying for any keywords.
Input 3:
"keywords": [],
"min_keyword_count": 5 # default value
n_plotted
- Default number of items plotted when no other filters are specified. For example, if all countries are analyzed and a graph is generated, only the top n_plotted lines will show up on the graphs in order to reduce the amount of clutter and highlight the most important data. This value is defaulted to 5.
Default filters will generate a textual summary of the data in terms of paper titles, which will be displayed in the Terminal. You may configure the following output filters to either generate an output CSV file (with the printed data on the Terminal) or a plot.
plot_data
- Given the filters selected above, this will generate TWO graphs (Figure 1 and FIgure 2) of the number of papers output by year as well as the outputs by institution. Primary graphical output includes graphing an analysis of the number of papers published by institution AND country selected per year. For example, Figure 1 will be your graph by country, and Figure 2 will be the top institutions for the country/countries selected.
Example 1: I want to plot out all papers where at least one contributing author is from Google from years 2013 - 2019.
Input 1:
"institution_analyzed": ['Google'],
"years_analyzed": [],
"plot_data": True
Example 2: I want to plot out all papers where at least one contributing author is from the US or China from years 2013 - 2019.
Input 2:
"country_analyzed": ['US', 'China'],
"years_analyzed": [],
"plot_data": True
Example 3: I only want papers where at least one contributing author is from Google from years 2013 - 2019.
Input 3:
"institution_analyzed": ['Google'],
"years_analyzed": [],
"plot_data": False
export_csv
- Exports a csv of the final dataframe with the title of the paper as well as the year. Same data as above with plotting the data, except instead a CSV is exported. To read the full text of the paper, simply search for the title of the paper and you will receive the full text. CSV exports do not contain full text in order to conserve memory.
Example 1: I want to export a CSV with all papers where at least one contributing author is from Google from years 2013 - 2019.
**Input 1: **
"institution_analyzed": ['Google']
"years_analyzed": [],
"export_csv": True
The summary analyses are preset summaries that can be printed in case you are curious about different varying summary statistics. analyze_coauthors - Graphs the amount of co-authors per paper from the filtered dataframe. For example, if a paper has two authors, then the number of papers with 2 authors will go up by 1. If there is 1 author, then the number of papers with 1 author will be incremented by 1. top_authors - Displays the top authors from the filtered dataframe, sorted in reverse order starting with the most publications over the selected year range.
verbose
- Enables all print statement details to be enabled when switched to True. Also will print out summary statistics by conference and other miscellaneous summaries.
Now that you’re familiar with how to set up PATS, let’s revisit our main goal of bolstering tech policy findings — the data provided by PATS provides an invaluable starting point when guiding critical discussion on AI policy.
In this section, we outline a potential policy framework you may find useful to help guide discussions.
Overarching Question: What is the main policy concept we want to explore?
Sub-Question: Are there further subdivisions of the Overarching Question?
- Main Question(s): What are the main questions within this topic area?
- Customer: Who is a (potential) customer?
- PATS Query: What is the query that needs to be inputted into PATS to answer the question?
- Other potential use cases: Other questions/queries that can be answered using PATS.
Overarching Question(s): As a funder, how can I identify top AI Research Talent in a more effective and efficient manner than just simply using our networks and press?
Sub-Question: Can I identifying top AI Researchers by topic?
- Main Question(s): Who are the top AI researchers working on
[insert topic]
? What are potential strategies to involve them in our talent programs? - Customer: Large Funding Agency (NSF, NASA, NIH), Philanthropic Funders (Schmidt Futures, CZI), Press (NYT, Politico)
- PATS Query: Filter by
[YOUR_TOPIC_HERE]
, and then rank authors in descending order. You can also filter how many times you want the word to appear.
Example:
"keywords": ['YOUR_TOPIC_HERE'],
"min_keyword_count": 5, # default
"export_csv": True
- Other potential use cases: Finding nominators for talent programs, inviting potentially lesser-known academics to competitions, sourcing of domain experts for opinions and dialogue, grant making use (who are the top performers?), policy making (who is contributing on a global stage? Potential interview to bring new insights to different AI commissions).
Overarching Question(s): How much international academic collaboration is occurring in AI? How much does the US work with China?
Sub-Question: How much do other countries work together?
- Main Question(s): How much is the US collaborating with other countries?
- Customer: Press, AI Commissions, other interested parties
- PATS Query: Filter papers by academic institutions and ‘US’
Example:
"country_analyzed": ['US'],
"institution_type_analyzed": ['Academia'],
"export_csv": True
- Other potential use cases: Recruitment internal/external talent programs both internationally and domestically.
While PATS performs at a high-level to accomplish given tasks, there are still certain limitations that present itself as future improvements. More specifically, we are limited by research discoverability and conference weighting among others.
While we have created a large corpus of papers, we primarily are limited by the amount of conferences that we currently scrape. We have to introduce newer scripts in order to scrape other more specific conferences to get more domain targeted papers such as specific conferences for a chosen topic (Association of Computational Linguistics for AI + Education). Lastly, since many of these top AI conferences are US based, there is inherent bias for US authors. This can be accounted for by analyzing more conferences.
In addition, we struggle with gauging research out of the public eye, such as private company research. While we have tried to expand PATS to account for this, it is hard to find high-integrity qualitative research indicators outside of academic conferences. Including pre-prints and abstracts may increase the amount of material analyzed, but this subset of writing is often of a lower-tier since they have not been peer-reviewed. As a result, this could decrease the signal to noise ratio in terms of quality publications, but could be an avenue to explore in order to better understand the other types of research happening.
Because the accepted papers into these conferences are typically associated with fundamental research, PATS does not allow prediction on which research methods mentioned in these papers will eventually lead to more impactful real-world applications.
Currently, all conferences are given equal weighting although some researchers may prioritize some over the other. For example, NeurIPS and ICML are historically regarded as the premier academic conference for submission, while the others, such as AAAI and CVPR, act as niche conferences with specific focuses within AI and Machine Learning that are typically less prestigious. Assigning equal weighting to these conferences can sometimes diminish the value of publications, but this can be alleviated when specifically looking at certain conferences.
PATS currently doesn’t support analysis of authorship ordering. Since academic papers often have authors listed in terms of authorship contributions, it can be important to track authorship order in order to determine which of the authors are contributing the most. However, one finer nuance to note is that oftentimes primary principal investigators (PIs) are often listed at the end of the list.
Our institutional tagging is currently limited by spreadsheet entries that map certain countries with institutions. This needs to be maintained otherwise for newer conferences, it will not work as well.